es 


N his two previous books, Statistical 
Methods for Research Workers, and The 
Design of Experiments, Sir Ronald Fisher 
had in view two practical aims. Firstly, 49 - 
facilitate correct statistical processes in the’ +4 
reduction of numerical observational 
material and secondly, to improve the 
quality of the numerical data obtained THY 8." 
experimentation by a deliberate study of the ra a 
underlying logic of the process and of the 
available means of achieving its aims. 
In neither book could adequate space be 
given to those refinements of the reasoning 
process which have clearly come into view 
with the study of the two previous aspects of 
scientific work. It is with those refinements 
that this new book is concerned. 
An examination of the logical nature of 
uncertain inference, including therein the 
concept of mathematical probability and 
some other numerical concepts equally 
requisite, is provided to consolidate the gains 
established on the other two fronts. The 
concept of mathematical probability is more 
strictly defined than has often been done, 
and its use in scientific inference correspond- N 
ingly restricted. "This removes a good deal P: 1 
of confusion arising from the attempt to "ml 
apply the concept uncritically to all cases. 
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We use Reason for improving the Sciences ; 
whereas we ought to use the Sciences for 
improving our Reason. 
ANTOINE ARNAULD, 1662 
(The Port-Royal Logic) 


Another use to be made of this Doctrine of 
Chances is, that it may serve in Conjunction 
with the other parts of the Mathematicks, as a 
fit Introduction to the Art of Reasoning. 

De Morvrsg, 1718 


If logic investigates the general principles of 
valid thought, the study of arguments, to which 
it is rational to attach some weight, is as much 
part of it as the study of those which are 


demonstrative. 
J. M. Keynes, 1921 
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CHAPTER I 


FOREWORD 


T " 
Mor es output, characteristic of the present 
of hem à "s ks on various aspects of Statistics, many 
able in th a much higher standard than were avail- 
coverin B past, not all in the scientific field, but 
cial, e P e requirements of technological, commer- 
recent ucational and administrative purposes, isa 
perha: efflorescence following as a natural and 
wards e inevitable consequence on t 
lon abstract understanding largely set on foot 
ie oe by that versatile and somewhat eccentric 
the of genius, Francis Galton. Although many of 
are i BE Eos: Vo which statistical methods and ideas 
Po successfully applied, are not primarily 
towa d c in aim, that is, are not directed specifically 
world s an improved understanding of the natural 
studie. yet the fruitfulness and success of the train 0 
Ow les initiated by Galton were, I submit, due to 
t n outlook of untrammelled scientific curiosity, an 
9 his confidence that it was in regard to scientific 
phus that a more penetrating statistical method- 
Md was required. Though all branches © 
i Science have profited and been rev^ M 
nfluence, it is the course of progress achieved on 
Scientific front which requires recapitulation, if the 
nature of the whole movement is to. be grasped i 
PE a f its growing complexity d diversity- 
alton's great gift lay iD 
is during his lite of at vagueness of many 9 
A 


he efforts to- 
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phrases in which men tried to express themselves in 
describing natural phenomena. He was before his 
time in his recognition that such vagueness could 
be removed, and a certain precision of thought 
attempted by finding quantitative definitions of 
concepts fit to take the place of such phrases as "the 
average man", "variability", "the strength of in- 
heritance", and so forth, through the assembly of 
objective data, and its systematic examination. 
That the methods he himself used were often 
extremely crude, and sometimes seriously faulty, is, 
indeed, the strongest evidence of the eventual value 
to the progress of science of his unswerving faith 
that objectivity and rationality were accessible, even 
in such elusive fields as psychology, if only a factual 
basis for these qualities were diligently sought. The 
systematic improvement of statistical methods and 
the development of their utility in the study of 
biological variation and inheritance were the aims to 
which he deliberately devoted his personal fortune, 
through the support and endowment of a research 
laboratory under Professor K. Pearson. 

` The peculiar mixture of qualities exhibited by 
Pearson made this choiceinsome respects regrettable, 
though in others highly successful. Pearson's energy 
was unbounded. In the course of his long life he 
gained the devoted service of a number of able 
assistants, some of whom he did not treat particularly 


well. He was prolific in magnificent, or grandiose, 
schemes capable of realization perhaps by an army 
of industrious robots 


: responsive to a magic wand. 
n à sense he undoubtedly appreciated Galton's 
conception of the greatness of the potential con- 
tribution of Statistics in the service of Science, and as 
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a means of rendering strictly scientific a range of 
studies not traditionally included in the Natural 
Sciences, but, as perceived through his eyes, this 
greatness was not easily to be distinguished from the 
greatness of Pearson himself. 

The terrible weakness of his mathematical and 
scientific work flowed from his incapacity.in self- 
criticism, and his unwillingness to admit the possi- 
bility that he had anything to learn from others, even 
in biology, of which he knew very little. His 
mathematics, consequently, though always vigorous, 
were usually clumsy, and often misleading. In 
controversy, to which he was much addicted, he 
constantly showed himself to be without a sense of 
justice. In his dispute with Bateson on the validity 
of Mendelian inheritance he was the bull to a skilful 
matador. His immense personal output of writings, 
his great enterprise in publication, and the excellence 
of production characteristic of the Royal Society 
and the Cambridge Press, left an impressive literature. 
'The biological world, for the most part, ignored it, 
for it was indeed both pretentious and erratic. Yet 
the intrinsic magnitude of some of the problems 
brought into discussion, the high prestige that 
mathematical writing always carries, and a certain 
imaginative boldness, did suffice to save this material 
from complete neglect. Little as Pearson cared for 
the past—for example, for the Gaussian tradition of 
least square techniques—and much as he would have 
disliked the future of statistical science, his activities 
have a real place in the history of a greater movement. 

Though Pearson did not appreciate it, quantitative 
biology, especially in its agricultural applications, 
was beginning to need accurate tests of significance. 


CE 
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As early as Darwin's experiments on growth rate pis 
need was felt for some sort of a test of whether E- 
apparent effect "might reasonably be due to nd 
At the same time it was recognized that the availa e 
test based on the conventional “probable error bc 
not always to be relied on. I have discussed x 
particular case in The Design of Experiments (Chap a 
III).! It was characteristic of the early period, i 

of Pearson, that such difficulties were habitua i 
blamed on “paucity of data”, and not ascribe 

specifically to the fact that mathematicians had 
far offered no solution which the practitioner cou 

use, and indeed had not been sufficiently aware of the 
difficulty to have discussed the problem. As is well 
known, it was a research chemist, W. S. Gossett, 
writing under the designation of ‘‘Student”,4 who 


supplied the test which, important as it was in itself, 
was of far 


first stage o 
attained su 


exact solut 
which were th 
being discuss 


distributions, and the tests of significance based upon 
them. The logi 
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scientific application, to reinterpret them in terms of 
an imagined process of acceptance sampling, such as 
was beginning to be used in commerce; although 
such processes have a logical basis very different 
from those of a scientist engaged in gaining from his 
observations an improved understanding of reality. 

The exact solutions of the series of problems of 
distribution left unsolved by the Pearsonian school 
had not only as its immediate fruit the refinement, 
by accurate tests of significance, of the experimenter's 
facility in examining his data critically; at a deeper 
logical level they allowed of the development of 
objective principles of estimation, and so revealed 
the misleading character of many of the methods of 
estimation commonly advocated. The variety of 
concepts relevant to the logical basis of a process of 
estimation of hypothetical quantities by the aid of 
observational material is considerable, and some 
account is given of them in Chapter VI. It was early 
necessary to distinguish Mathematical Likelihood 
from Mathematical Probability, and the concept of 
quantity of information (different from the meaning 
later given to the same phrase in Communication 
Theory) itself intimately related to the Likelihood, 
was found to measure effectively the competence of 
any proposed method of estimation, even in its 
application to small samples. 

From a larger viewpoint than that of merely 
refining and perfecting the statistical processes used 
in the examination of a fixed body of data, the con- 
cepts of the theory of estimation lent themselves tothe 
effectual comparison of different bodies of data, and 
therefore of the experimental procedures, or observa- 
tional programmes capable of giving rise to such 
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observational foundations. This is the leading con- 
sideration in the branch of statistical science known 
as Experimental Design, on which during the last few 
years very comprehensive and substantial works 
have appeared. The practical and theoretical study 
of Experimental Design, with which should be 
included that of sampling for the purpose of factual 
ascertainment as developed by Mahalanobis,? and by 
Yates? may be regarded as the second great move- 
ment in the development of Statistics for the 
clarification of scientific thought. 

In the author's view, however, there has already 
appeared a need for the exposition and consolidation 
of the specifically logical concepts which have emerged 
as it were as by-products of both (a) the purely 
mathematical elucidation of statistical problems in 
the first phase, and (b) in the second phase the develop- 
ment of experimental designs, logically coherent 
With the processes used in their discussion, and with 
the scientific inferences of Which they are to supply 
the basis, so as to form with them a complete illustra- 
tion of the mode in Which new scientific knowledge is 
generated. Once recognized and applied, there is 
little danger of such an advance in procedure being 
lost. _ Practitioners, however, are not the natural 
Tepositories of logical niceties, and teachers especially, 
engaged in introducing students to these new fields, 
may value an attempt to consolidate the specifically 
logical gains of the past half-century, and will 
perhaps tolerate a certain amount of necessary 
hair-splitting. 

In the introduction 
(p. 9" I have stressed 
framing cogent expe 


to the Design of Experiments 
my conviction that the art of 
riments, and that of their 
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statistical analysis, can each only establish its full 
significance as parts of a single process of theimproye- 
ment of natural knowledge; and that the logical 
coherence of this whole is the only full justification 
for free individual thought in the formation of 
opinions applicable to the real world. I would wish 
again now to reiterate this point of view. To one 
brought up in the free intellectual atmosphere of an 
earlier time there is something rather horrifying in 
the idealogical movement represented by the doctrine 
that reasoning, properly speaking, cannot be applied 
to empirical data to lead to inferences valid in the 
real world. It is undeniable that the intellectual 
freedom that we in the West have taken for granted 
is now successfully denied over a great part of the 
earth's surface. The validity of the logical steps by 
which we can still dare to draw our own conclusions 
cannot therefore, in these days, be too E) ex- 
pounded, or too strongly affirmed. 
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CHAPTER II 


THE EARLY ATTEMPTS AND THEIR 
DIFFICULTIES 


1. Thomas Bayes 


For the first serious attempt known to us to give a 
rational account of the process of scientific inference 
as a means of understanding the real world, in the 
sense in which this term is understood by experi- 
mental investigators, we must look back over two 
hundred years to an English clergyman, the Reverend 
Thomas Bayes, whose life spanned the first half of the 
eighteenth century. It is indeed only in the present 
century, with the rapid expansion of those studies 
which are collectively known as Statistics, that the 
importance of Bayes’ contribution has come to be 
appreciated. The Dictionary of National Biography," 
representing opinion current in the last quarter of the 


nineteenth century, does not include his name. The 
omission is the more striking since this work of 
reference does include a 


notice of his father, J oshua 
Bayes (1671-1746). While the father was no doubt 
a learned and eloquent preacher, still, in his own 
time, his son Thomas was for twenty years a Fellow of 
the Royal Society, and therefore known also as a not 
Inconsiderable mathematician. Indeed, his mathe- 


matical contributions to the Philosophical Trans- 
actions show him 


: to have been in the first rank of 
Independent thinkers, Very well qualified to attempt 
the really revolutionary task opened out by his 
Posthumous paper “An Essay towards solving a 
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problem in the doctrine of chances", which appeared 
in the Philosophical Transactions in 1763, not long 
after his death in 1761. 

It is entirely appropriate that this first attempt 
should have been made at this time. For more than 
a century the learned world had been coming to 
regard deliberate experimentation as the funda- 
mental means to “The Improvement of Natural 
Knowledge", in the words chosen by the Royal 
Society. With Isaac Newton, moreover, and such 
men as Robert Boyle, the possibility of formulating 
naturallaw in quantitative terms had been brilliantly 
exhibited. The nature of the reasoning process by 
which appropriate inferences, or conclusions, could 
be drawn from quantitative observational data was 
ripe for consideration. The prime difficulty lay in 
the uncertainty of such inferences, and it was a 
fortunate coincidence that the recognition of the 
concept of probability, and its associated mathe- 
matical laws, in its application to games of chance, 
should at the same time have provided a possible 
means by which such uncertainty could be specified 
and made explicit. In England such a publication 
as Abraham de Moivre's Doctrine of Chances? must 
have been a very immediate stimulus to Bayes' 
reflexions on this subject. 

Bayes Essay was communicated to the Royal 
Society some time after his death by his friend 
Richard Price. Price added various demonstrations 
and illustrations of the method, and seems to have 
replaced Bayes' introduction by a prefatory explana- 
tion of his own. Tt is to be regretted that we have not 
Bayes' own introduction, for it seems clear that 
Bayes had recognized that the postulate used in his 
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demonstration would be thought disputable by a 
critical reader, and there can be little doubt that this 
was the reason why his treatise was not offered for 
publication in his own lifetime. Price evidently laid 
less weight on these doubts than did Bayes himself; 
on the other hand he very fully appreciated the 
importance of what Bayes had done, or attempted, 
for the advancement of experimental philosophy; 
although the central theorem of the essay is 
framed in somewhat academic and abstract terms, 
without expatiating on the large consequences 
for human reasoning which would flow from its 
acceptance. 


The most important passage of Price's introductory 
letter is as follows (! p. 370): 


Inan introduction which he has writ to this Essay, he says, 
that his design at first in thinking on the subject of it was, 
to find out a method by which we might judge concerning 
the probability that an event has to happen, in given 
circumstances, upon supposition that we know nothing con- 
cerning it but that, under the same circumstances, it has 
happened a certain number of times, and failed a certain 
- He adds, that he soon perceived 

difficult to do this, provided some 
ing to which we ought to estimate 


chance the same that it should lie 
distant degrees; which, if it were 
ight be easily calculated in the common 
ding in the doctrine of chances. 


two equi 


Ir 
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not perhaps be looked upon by all as reasonable; RE. 
fore he chose to lay down in another form the P 
in which he thought the solution of the problem is pes 2: s 
and in a scholium to subjoin the reasons why he thoug i I 
rather than to take into his mathematical reasoning anything 
that might admit dispute. 


The actual mathematics of Bayes’ theorem may be 
expressed very briefly in modern notation. 1 
If in a+b independent trials it has been observe 
that there have been a successes and 6 failures, then 
if 5 were the hypothetical probability of success in 
each of these trials, the probability of the happening 

of what has been observed would be 


Wires 8 


but if in addition we know, or can properly postulate, 
that p itself has been chosen by an antecedent 
random process, such that the probability of 5 lying 


in any infinitesimal range dp between the limiting 
values o and x is equal simply to 


dp, (2) 
then the probability of the com 
lying in the assi 
numbers of 
the product 


pound event of p 
gned range, and of the Observed 
Successes and failures occurring will be 
of these two expressions, namely 

a+b)! 

ED bep ap. (3) 


But, from the data, 
happened for some el 
which the total range 


such a compound event has 
ement or other of those into 


from o to x may be divided, so 
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that the probability that any particular one should 
have happened in fact is the ratio, 


(ED! pets py dp 


e] rs 


of which the denominator, involving a complete 
Eulerian integral, is equal 1/(a+b+1). 
The finite probability that p should lie between any 


assigned limits 4 and v may therefore by expressed 
as the incomplete integral 


a I)! 
e | pafz) dp. (5) 
The postulate which Bayes regarded as question- 
able is represented above by the expression (2). The 
greater part of Bayes’ analysis is concerned with 
approximate forms for the discussion of these 
integrals, and is of historical rather than mathe- 
matical interest for the modern reader. How 
explicit Bayes is in introducing his critical postulate 
is shown by his own introductory remarks. 
The crucial theorem, proposition 8, comes in the 
second section of the essay, and is preceded by a 


special explanatory foreword! (p. 385): 


(4) 


Section II 
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through the point where it rests a line os shall be drawn 
parallel to AD, and meeting CD and AB in s and o; and that 
afterwards the ball O shall be thrown p+g or times, and 
that its resting between AD and os after a single throw be 
called the happening of the event M in a single trial. These 
things supposed, 

Lemma r. The probability that the point o will fall 
between any two points in the line AB is the ratio of the 
distance between the two points to the whole line AB. (The 
proof occupies two pages, with an examination of incom- 
mensurability after the manner of the fifth book of Euclid's 
elements.) 

Lemma 2. The ball W having been thrown, and the line 


os drawn, the probability of the event M in a single trial is 
the ratio Ao to AB. 


After a short proof there follow the enunciation of 
proposition 8 and its demonstration. A single figure 
is used to represent the square table and the con- 
struction upon it, and also outside the square, a graph 
representing the function, 


a+b)! 
ei py, p+q=t, 


for all values of ? from o to x. The latter is used to 
Sve geometrical significance to the analytic integrals 
used above. 
In broaching the boundaries of an entirely new 
field of thought by means of a single illustrative 
theorem, pregnant as it was, Bayes left untouched 
nam distinctions of importance to its discussion in 
he future, In respect to the nature of the concept of 
Probability very diverse opinions have been expressed. 
ih particular, although perhaps all would agree that 
* Word denotes a measure of the strength of an 


Opinion or state of judgement, some have insisted 
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that it should properly be used only for the expression 
of a state of rational judgement based on sufficient 
objective evidence, while others have thought that 
equality of probability may be asserted merely from 
the indifference of, or the absence of differentiae in, the 
objective evidence, if any, and therefore from the 
total absence of objective evidence, if there were 
none. 

Bayes evidently held the first of these opinions and 
frames a definition suited—in my view—to show (1) 
that he was not thinking merely of games of chance, 
and, (2) at the same time that his concept of pio 
bability was that of the mathematicians, such as 
Montmort and de Moivre, who had treated largely of 
gambling problems, in which the equality of pro- 
bability assigned to numerous possible events and 
combinations of events is a consequence of the 
assumed perfection of the apparatus and operations 
employed. 

P. 376): “5. The probability of 
etween the value at which an 


the limiting value of the relative frequency of 
success, 


On the contrary Laplace, who needed a definition 
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wide enough to be used in the vastly diverse applica- 
tions of the Théorie analytique, manifestly inclined to _ 
the second view ? (1820). 


La théorie des hasards consiste à réduire tous les 
événemens du méme genre, à un certain nombre de cas 
également possibles, c'est-à-dire, tels que nous soyons 
également indécis sur leur existence; et à déterminer le 
nombre de cas favorables à l'événement dont on cherche la 
probabilité. Le rapport de ce nombre à celui de tous les cas 
possibles, est la mesure de cette probabilité qui n'est ainsi 
qu'une fraction dont le numérateur est le nombre des cas 
favorables, et dont le dénominateur est le nombre de tous les 
cas possibles. 


This differs a little from the form used in 1812: 


La théorie des probabilités consiste à réduire tous les 
événemens qui peuvent avoir lieu dans une circonstance 
donnée, à un certain nombre de cas également possibles, 
c'est-à-dire tels que nous soyons également indécis sur leur 
existence, et à déterminer parmi ces cas, le nombre de ceux 
qui sont favorables à l'événement dont on cherche la 
probabilité. Le rapport de ce nombre à celui de tous les cas 
possibles, est la mesure de cette probabilité qui n'est donc 
qu'une fraction dont le numérateur est le nombre des cas 
favorables, et dont le dénominateur est celui de tous les cas 


possibles. 


It is seen that Laplace effectively avoids any 
objective definition, first by using the term possible 
in a context in which probable could be used, without 
explaining what difference, if any, he intends d 
the two words, and secondly by his indication m 
equal possibility could be judged without coge 
evidence. 

In consequence 
Bayes' attempt is expose 


MSS 1 
f this difference of concept, 
É d to different types of 


E 
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criticism in his own hands and in those of Laplace. 
While I have, for myself, no doubt that Bayes 
definition is the more satisfactory, being not only in 
accordance with the ideas upon which the Doctrine 
of Chances of his own time was built, but in con- 
necting the comparatively modern notion of Vor 
bability, which seems to have been unknown to the 
Islamic and to the Greek mathematicians, with the 
much more ancient notion of an expectation, capable 
of being bought, sold and evaluated, nevertheless it 
would merely confuse the discussion to give further 
reasons for this opinion; the difficulties to which 
Bayes' approach was eventually found to lead can be 
easily expressed in terms of the notions which he 
himself favoured, , 

Whereas Laplace defined probability by means 0 
the enumeration of discrete units, Bayes defined a 
continuous probability distribution, bya formula for 
the probability between any pair of assigned limits. 
He did not, however, consider the metric of his con- 
tinuum. For in stating his prime postulate in the 
form that the chance a priori of the unknown pro- 
bability lying between 2, and $, shall be equal to 
Psp; he might, so far as cogent evidence is con- 


cerned, equally have taken any monotonic function 
of p, such as 


$= $ cosi (—25), p=sin? $, 


and postulated that the chance that $ should lie 
between ¢, and 4, should be 


= ($,—4), 
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so that, instead of inserting the probability a priori as 


dp, 


it would have appeared in the analysis as 
2 I 
CC ATTE q—1—$, (6) 


a postulate or assumption rather more favourable to 
extreme values of the unknown 4, near to o or I, at 
the expense of more central values. 

Bayes' introduction of an expression representing 
probability a priori thus contained an arbitrary 
element, and it was doubtless some consciousness of 
this that led to his hesitation in putting his work 
forward. A more important question, however, is 
whether in scientific research, and especially in the 
interpretation of experiments, there is cogent reason 
for inserting a corresponding expression representing 
probabilities a priori. This practical question cannot 
be answered peremptorily, or in general, for certainly 
cases can be found, or constructed, in which valid 
probabilities a priori exist, and can be deduced from 
the data. More frequently, however, and especially 
when the probabilities of contrasted scientific theories 
are in question, a candid examination of the data at 
the disposal of the scientist shows that nothing of the 
kind can be claimed. 


2. George Boole 
The superb pre-eminence of Laplace as a mathe- 
matical analyst undoubtedly inclined mathematicians 
for nearly fifty years to the view that the logical 
approach adopted by him had removed all doubts 
as to the applicability in practice of Bayes theorem. 
B 
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That this was indeed Laplace's view may be judged 
from his reference to the position of Bayes in the 
history of the subject? (p. cxxxvii). 

Bayes dans les Transactions philosophiques de lannée 
1763, a cherché directement la probabilité que les possibilités 
indiquées par des expériences déjà faites, sont comprises dans 
des limites données; et il y a parvenu d'une maniére fine et 
trés ingénieuse, quoiqu'un peu embarrassée. 

I imagine that the hint of criticism in the last 
phrase is directed against Bayes' hesitation to regard 
the postulate he required as axiomatic. It will be 
noticed in the sequel that discussion turned on just 
this question of its axiomatic nature, and not on the 
question, more natural to an experimental investi- 
gator, of whether, in the particular circumstances of 
the investigation, the knowledge implied by the 
postulate was or was not in fact available. It is the 
submission of the author that actual familiarity with 
the processes of scientific research helps greatly in 
the understanding of scientific data, and has in the 
present century clarified the issue by bringing into 
prominence the factual question, rather than the 
abstract question of axiomatic validity. It was not, 
however, until the weight of opinion among philo- 
Sophical mathematicians had turned against the 
supposed axiom that the controversy could come to 
be examined in this more realistic manner. ' 

Simple examples are provided by genetic situations. 
In Mendelian theory there are black mice of two 
genetic kinds. Some, known as homozygotes (BB), 
when mated with brown yield exclusively black off- 
spring; others, known as heterozygotes (Bb), while 
themselves also black, are expected to yield half 
black and half brown. The expectation from a 
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mating between two heterozygotes is I homozygous 
black, to 2 heterozygotes, to 1 brown. A black 
mouse from such a mating has thus, prior to any test- 
mating in which it may be used, a known probability 
of 1/3 of being homozygous, and of 2/3 of being 
heterozygous. If, therefore, on testing with a brown 
mate it yields seven offspring, all being black, we 
have a situation perfectly analogous to that set out 
by Bayes in his proposition, and can develop the 
counterpart of his argument, as follows: 

The prior chance of the mouse being homozygous 
is 1/3; if it is homozygous the probability that the 
young shall beallblackis unity; hence the probability 
of the compound event of a homozygote producing 
the test litter is the product of the two numbers, or 
1/3. 

Similarly, the prior chance of it being heterozygous 
is 2/3; if heterozygous the probability that the young 
shall be all black is 1/27, or 1/128; hence the pro- 
bability of the compound event is the product, 
1/192. 

But, one of these compound events has occurred; 
hence the probability after testing that the mouse 
tested is homozygous is 


ripe br 
tla (gda) lon 
and the probability that it is heterozygous is 


A e ue pes 
1/192+ G 3r ao) 1/65. 
If, therefore, the experimenter knows that the 
animal under test is the offspring of two heterozygotes, 
as would be the case if both parents were known to be 
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black, and a parent of each were known to be brown, 
or, if, both being black, the parents were known to 
have produced at least one brown offspring, cogent 
knowledge a priori would have been available, and 
the method of Bayes could properly beapplied. But, 
if knowledge of the origin of the mouse tested were 
lacking, no experimenter would feel he had warrant 
for arguing as if he knew that of which in fact he was 
ignorant, and for lack of adequate data Bayes' 
method of reasoning would be inapplicable to his 
problem. 

It is evidently easier for the practitioner of natural 
science to recognize the difference between knowing 
and not knowing than this seems to be for the more 
abstract mathematician. The traditional line of 
thought running from Laplace to, for example, Sir 
Harold Jeffreys in our own time would be to argue 
that, in the absence of relevant genealogical evidence, 
there being only two possibilities, mutually exclusive, 
and with no prior information favouring one rather 
than the other, it is axiomatic that their probabilities 
a priori are equal, and that Bayes’ argument should 
be applied on this basis. This is to treat the problem, 
in which we have no genealogical evidence, exactly 
as if the mouse to be tested were known to have been 
derived from a mating producing half homozygotes 
and half heterozygotes. . 

In spite of the high prestige of all that flowed from 
Laplace's pen, and the great ability and industry of 
his expositors, it is yet surprising that the doubts 
which such a process of reasoning from ignorance 
must engender should begin to find explicit expression 
only in the second half of the nineteenth century, 
and then with caution. That extraordinary work, 
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The Laws of Thought, by George Boole appeared in 
1854.3 Its twentieth chapter is given to problems of 
causes, and of this the second half to problems in 
which (p. 320) we “may be required to determine the 
probability of a particular cause, or of some particular 
connection among a system of causes, from observed 
effects,". A hint of Boole's point of view appears in 
the opening words of Section 20: 


It is remarkable that the solutions of the previous problems 
are void of any arbitrary element. We should scarcely, from 
the appearance of the data, have anticipated such a circum- 
stance. It is, however, to be observed, that in all those 
problems the probabilities of the causes involved are supposed 
to be known a priori. In the absence of this assumed element 
of knowledge, it seems probable that arbitrary constants 
would necessarily appear in the final solution. 


The cases chosen by Boole to illustrate this view 
are two: (a) The Reverend J. Michell’ had calculated 
that if stars of each magnitude were dispersed at 
random over the celestial sphere, there would very 
rarely occur so many apparent double stars or 
clusters as those actually observed by astronomers. 
(b) The planes of revolution of the planets of the solar 
system are more nearly coincident than could often 
occur if these planes had been assigned at random. 
Instead of treating these calculations, as would now 
generally be done, as tests of significance overthrowing 
the theory of random dispersal, and therefore all 
cosmological theories implying random dispersal 
(disposing of this hypothesis without reference to, or 
consideration of, any alternative hypothesis which 
might be actually or conceivably brought forward) E 
instead of this, it had been thought proper to discuss 
each question as one in inverse probability, and 


aos e 


—— 
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Boole has no difficulty in showing that as such it 
requires two elements really unknown, namely the 
probability a priori of random dispersal, and, 
secondly, the probability, in the aggregate of alter- 
native hypotheses, of the observed frequency of 
conjunctions being realized. As he says on p. 367, 
“Any solutions which profess to accomplish this 
object, either are erroneous in principle, or involve a 
tacit assumption respecting the above arbitrary 
elements.” 
Again in Section 22: 


Are we, however, justified in assigning to [these two 
unknowns] particular values? I am strongly disposed to 
think that we are not. The question is of less importance in 
the special instance than in its ulterior bearings. In the 
received applications of the theory of probabilities, arbitrary 
constants do not explicitly appear; but in the above, and in 
many other instances sanctioned by the highest authorities, 
some virtual determination of them has been attempted. 
And this circumstance has given to the results of the theory, 
especially in reference to questions of causation, a character 
of definite precision, which, while on the one hand it has 
Seemed to exalt the dominion and extend the province of 
numbers, even beyond the measure of their ancient claim to 
rule the world; on the other hand has called forth vigorous 
protests against their intrusion into realms in which con- 
jecture is the only basis of inference. The very fact of the 
appearance of arbitrary constants in the solutions of problems 
like the above, treated by the method of this Work, seems to 
imply, that definite solution is impossible, and to mark the 
point where inquiry ought to stop. 


On page 370: 


It has been said, that the principle involved in the above 
and in similar applications is that of the equal distribution of 
our knowledge, or rather of our ignorance—the assigning to 
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different states of things of which we know nothing, and upon 
the very ground that we know nothing, equal degrees of 
probability. I apprehend, however, that this is an arbitrary 
method of procedure. : 


And finally, on page 375: 


These results only illustrate the fact, that when the defect 
of data is supplied by hypothesis, the solutions will, in 
general, vary with the nature of the hypotheses assumed; so 
that the question still remains, only more definite in form, 
whether the principles of the theory of probabilities serve 
to guide us in the election of such hypotheses. Ihavealready 
expressed my conviction that they do not—a conviction 
strengthened by other reasons than those above stated... 
Still it is with diffidence that I express my dissent on these 
points from mathematicians generally, and more especially 
from one who, of English writers, has most fully entered into 
the spirit and the methods of Laplace; and I venture to 
hope, that a question, second to none other in the Theory 
of Probabilities in importance, will receive the careful 
attention which it deserves. 


These quotations, which I have picked out from 
the rather lengthy mathematical examples which 
Boole developed, are sufficient to exhibit unmistak- 
ably his logical point of view. He does not, indeed, 
go so far as to say that no statements in terms of 
mathematical probability can properly be based on 
data of the kind considered, but he is entirely clear 
in rejecting the application to these cases of the 
method of arriving at such statements which, in the 
absence of appropriate data, introduced values of the 
probabilities a riori supported only by a question- 
able axiom. á 

His phrase, moreover, on supplying _by hypo- 
thesis what is lacking in the data, points to an 
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abuse very congenial to certain twentieth-century 
writers. 


3. John Venn and the Rule of Succession 


An immediate inference from Bayes' theorem 
assigning the frequency distribution 


(EE! po (1 gp dp (7 


to the probability of success of an event, supposed 
constant, after a successes have been observed in 
a+b independent trials, is to calculate the probability 
of success of a new trial, of the same kind as the 
others and, like them, independent. For this we 
have only to multiply the frequency element above 
by # and integrate between the limits o and r. The 
result takes the simple form 


(a+1)/(a+b+2), (8) 


and this inference came to be known as the Rule of 
Succession; it is often quoted in the form taken 
when 5—o, as leading to the probability (a+1)/(a+2). 

It should be emphasized, as it has sometimes 
passed unnoticed, that such a rule can be based on 
Bayes’ theorem only on certain conditions, It 
requires that (i) the record of a successes out of a+b 
trials constitutes the whole of the information avail- 
able; (ii) the successive trials are independent in the 
sense that the success or failure of one trial has no 
effect in favouring the success or failure of subsequent 
tests, which have in each case the same probabilities. 

During the long period over which its correctness 
was unquestioned, the Rule of Succession had been 


| 
i 
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eagerly seized upon by logicians as providing a solid 
mathematical basis for inductive reasoning. In his 
Logic of Chance, Venn, who was developing the con- 
cept of probability as an objective fact, verifiable by 
observations of frequency, devotes a chapter to 
demolishing the Rule of Succession, and this, from a 
writer of his weight and dignity, had an undoubted 
effect in shaking the confidence of mathematicians in 
its mathematical foundation. 

Venn, however, does not discuss its foundation, 
and perhaps was not aware that it had a mathematical 
basis demonstrated by Laplace; that like other 
mathematical theorems it contained stipulations 
specific for its validity; and that in particular it 
rested upon the supposed, though disputable, axiom 
used for the demonstration of Bayes' proposition. 
As in other cases in which a work of demolition is 
undertaken with great confidence, there is no doubt 
that Venn in this chapter uses arguments of a quality 
which he would scarcely have employed had he 
regarded the matter as one open to rational debate. 

After giving instances of not very unreasonable 
inferences drawn by Laplace and De Morgan with the 
aid of the Rule, Venn writes * (p. 180): 


Let us add an example or two more of our own. I have 
Observed it rain three days successively,—I have found on 
three separate occasions that to give my fowls strychnine 
has caused their death, —I have given a false alarm of fire on 
three different occasions and found the people come to help 


me each time. 


These examples seem to be little more than 
rhetorical sallies intended to overwhelm an opponent 
with ridicule, They scarcely attempt to conform 
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with the conditions of Bayes' theorem, or of the rule 
of succession based upon it. In the last case the 
reader is to presume, the same neighbours having 
been deceived on three occasions, that on the fourth 
they will be for this reason less ready to exert them- 
selves; that is to say, the successive trials are not 
even conceived to be independent. Objection could 
be made on the same ground to the first example, 
which is perhaps particularly unrealistic in that three 
rainy days are postulated to comprise the whole of 
the subject's experience of days wet or fine. Perhaps 
the example could be repaired by making him arrive 
by air in a region of unknown climate; if so, Bayes’ 
postulate implies that the region has been chosen 
at random from an aggregate of regions in each 
of which the probability of rain is constant and 
independent from day to day, while this probability 
varies from region to region in an equal distribution 
from oto x. If applied to cases in which this informa- 
tion is lacking the inference is not indeed ridiculous, 
though I should agree with Venn that it would often 
be found to be mistaken, if put to the test of repeated 
trials, and it can scarcely be doubted that Bayes 
would have taken the same view. A climate without 
the unnatural feature of independence of weather 
from day to day, and therefore without conforming 
to the conditions of Bayes' theorem, might yet 
justify the Rule of Succession, in the limited form here 
used, if the proportion of all rainy days falling in 
spells of n successive rainy days were 


4n 
(nx) (- 2) 0:3)? (9) 


and it is clearly a question of ascertainable fact, and 
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not of personal predilection, whether the climate of 
any part of the world conforms to such a rule.* 

The standardization of drugs by an experimental 
assay of their potency often involves the determina- 
tion of the 50% lethal dose, or, that which, with the 
population of animals sampled for testing, will have 
a probability of 50% of killing in each case. The 
rhetorical force of Venn's example lies in the pre- 
sumption that much more than the 509/ lethal dose 
of strychnine was employed, but a valid criticism. 
of Bayes' theorem through the failure of the Rule of 
Succession requires a less cavalier treatment of the 
example. If, for example, 50 strengths of dose were 
made up at concentrations capable of killing 1%, 
395, -.., 99% of the animals to be tested, and if each 
experiment consisted in choosing one of these doses, 
with equal probability, and applying this dose to 
each of four hens chosen at random; of ascertaining 


* While this simple distribution suffices to justify the Rule of 
Succession when applied to experience of only wet or of only fine days, 
the general form of the rule requires that the spells of wet and fine 
weather must be arranged to fulfil further conditions. Thus the 
frequency with which two successive spells are of u and v days 
respectively must be 

6(u-+1)! (v+1)! " (19) 
(u+v+3)! 
consistent with the marginal frequency for u and v 
a2(u !) (11) 
(u+-3)! 


cessive spells of lengths uw, v and w, in that order, have 


Three suci 


frequenc 
Iu 62)! (utu)! td 


(utu+w+3)! * 


while four spells of lengths f, #, v, W will have a frequency 


(eux)! epu? (13) 
(Heutot+wt3)! 
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first if each of three of these hens had died, and if so 
of predicting that the fourth would die also, the pro- 
portion of successes would agree closely with the 
fraction 4/5 given by the rule of succession. If, on 
the contrary, the doses were spaced in toxic content 
between the 1% and the 99% dosages, either in 
arithmetical or in geometrical progression, the fore- 
cast, though by no means a ridiculous estimate, 
would doubtless be somewhat in error. Knowledge of 
the experimental conditions might thus justify the 
rule, though it cannot rationally be based on ignorance 
of them. 

It seems that in this chapter Venn was to such an 
extent carried away by his confidence that the rule 
of induction he was criticizing was indefensible in 
many of its seeming applications, and by his eagerness 
to dispose of it finally, that he became uncritical of 
the quality of the arguments he used. A most 
serious lapse of a general character appears on page 
181 at the beginning of Section 9: 


It is surely a mere evasion of the difficulty to assert, as is 
sometimes done, that the rule is to be employed in those cases 
only in which we do not know anything beforehand about the 
mode and frequency of occurrence of the events. The truth 
or falsity of the rule cannot be in any way dependent upon 
the ignorance of the man who uses it. His ignorance affects 
himself only, and corresponds to no distinction in the things. 


Taken in its Sweeping generality such an argument 
seems to imply that the extent of the observational 
data available can have no bearing on the nature or 
precision of our inferences from them; that a jury 
ignorant of certain facts ought to give the same 
verdict as one to whom they have been presented! 
The precise Specification of our knowledge is, how- 


EARLY ATTEMPTS AND THEIR DIFFICULTIES 29 


ever, the same as the precise specification of our 
ignorance. Certainly, the observer's knowledge or 
ignorance may have no effect on external objects, but 
the extent of the observations to which his reasoning 
is applied does make a selection of those material 
systems to which he imagines his conclusions to be 
applicable; and the objective frequencies observed 
in such selected systems may depend, and indeed 
must depend if inductive reasoning have any validity, 
on the observational basis by which the selection is 
effected, and on which the reasoning is based. 

It is certain that Venn understood this in respect 
of the inductive process generally, and that nothing 
but inadvertence can have led him to develop, in 
criticizing “the rule", a mode of argument fatal 
equally to all inferences based on experience. 

Perhaps the most important result of Venn's 
criticism was the departure made by Professor 
G. Chrystal in eliminating from his celebrated 
textbook of A/gebra? the whole of the traditional 
material usually presented under the headings of 
Inverse Probability and of the Theory of Evidence. 
Chrystal does not discuss the objections to this 
material, but expresses the opinion that "many of 
the criticisms of Mr. Venn on this part of the doctrine 
of chances are unanswerable. The mildest judge- 
ment we could pronounce would be the following 
words of De Morgan himself, who seems, after all, to 
have ‘doubted’: ‘My own impression derived from this 
and many other circumstances connected with the ana- 
lysis of probabilities, is, that the mathematical results 
have outrun their interpretation." (Chapter xxvi.) 

It should be noted that De Morgan's remark has 


been quoted clean out of its context; that he was not 
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writing on inverse probability, nor even on the theory 
of evidence, but about the curiously ubiquitous 
success of methods based on the Normal law of errors, 
even when applied to cases in which such a law is not 
accurately plausible. In fact the passage (from the 
Fourth Appendix, the title of which is “On the 
average result of a number of observations") goes on: 


and that some simple explanation of the force and meaning 
of the celebrated integral, whose values are tabulated at the 
end of this work, will one day be found to connect the higher 
and lower parts of the subject with a degree of simplicity 
which will at once render useless (except to the historian) all 
the works hitherto written. 


In reality, the introduction of the inverse method 
was to De Morgan ° (p. vi) one of the most important 
advances to be recorded in the history of the theory- 
of probability. I have already quoted his opinion to 
this effect, in the introduction to my book on 
The Design of Experiments 5: 


There was also another circumstance which stood in the 
way of the first investigators, namely, the not having con- 
sidered, or, at least, not having discovered the method of 
reasoning from the happening of an event to the probability 
of one or another cause. The questions treated in the third 
chapter of this work could not therefore be attempted by 


them. Given an hypothesis presenting the necessity of one 
or another out of a cert 


could not infer the pr ili i i i 
the event should caus 
But, just as in natur. 
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thesis by means of observed factsis always preliminary to any 
attempt at deductive discovery; so in the application of 
the notion of probability to the actual affairs of life, the 
process of reasoning from observed events to their most 
probable antecedents must go before the direct use of any 
such antecedent, cause, hypothesis, or whatever it may 
be correctly termed. These two obstacles, therefore, the 
mathematical difficulty, and the want of an inverse method, 
prevented the science from extending its views beyond 
problems of that simple nature which games of chance 


present. 


If he ever checked the reference to his quotation, 
therefore, Chrystal was scarcely playing fair. His 
case as well as Venn's illustrates the truth that the 
best causes tend to attract to their support the worst 
arguments, which seems to be equally true in the 
intellectual and in the moral sense. 


4. The meaning of probability 

Whatever view may be preferred on the contro- 
versial issues which the quotations set out above have 
been selected to illustrate, it is evident beyond 
question that highly competent, or even illustrious, 
mathematicians had formed upon them quite irrecon- 
cilable opinions; and this appearance of inability to 
find a common ground is not lessened by a perusal of 
what has been written in our own century. 

Since there is no reason to doubt the purely 
mathematical ability of these writers, it is natural to 
suspect a semantic difficulty due to an imperfect 
analysis of words regarded as being too simple to be 
elucidated by further examination, such as the word 
"probability" itself. Of course, each writer has 
“defined” this word to his own satisfaction. Mathe- 
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matical definition is, however, often no more than a 
succinct statement of the axioms to be applied when 
the word occurs in deductive mathematical reasoning, 
and may pay less attention than is needed to the 
conditions of the correct applicability of the term in 
the real world. It is these conditions of applicability 
which are properly the concern of those responsible 
for Applied Mathematics. 

Indeed, I believe that a rather simple semantic 
confusion may be indicated as relevant to the issues 
discussed, as soon as consideration is given to the 
meaning that the word probability must have to 
anyone so much practically interested as is a gambler, 
who, for example, stands to gain or lose money, in the 
event of an ace being thrown with a single die. To 
such a man the information supplied by a familiar 
mathematical statement such as: “If a aces are 
thrown in x trials, the probability that the difference 
in absolute value between a[n and 1/6 shall exceed 
any positive value e, however small, shall tend 
to zero as the number z is increased indefinitely", 
will seem not merely remote, but also incomplete 
and lacking in definiteness in its application to the 
particular throw in which he is interested. Indeed, 
by itself it says nothing about that throw. It is 
obvious, moreover, that many subsets of f 
throws, which may include his own, can be shown to 
give probabilities, in this Sense, either greater or less 
than 1/6. Before the limiting ratio of the whole set 
can be accepted as applicable to a particular throw, a 
second condition must be satisfied, namely that 
before the die is cast no such subset can be recognized. 
This is a necessary and sufficient condition for the 
applicability of the limiting ratio of the entire 


uture 
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aggregate of possible future throws as the probability 
of any one particular throw. On this condition we 
may think of a particular throw, or of a succession of 
throws, as a random sample from the aggregate, which 
is in this sense subjectively homogeneous and without 
recognizable stratification. 

Makers of the standard apparatus of games of 
chance, dice, cards, roulettes, etc., take great care to 
satisfy both the requirements of a sufficiently specific 
statement of what is meant by probability. If either 
the long-run frequencies were faulty, or, in particular, 
if there were any means of foreseeing, even to a 
limited extent, the outcome of their usein a particular 
case, the apparatus, or, perhaps, the method of 
using them, would be judged defective for the purpose 
for which they were made. 

This fundamental requirement for the applicability 
to individual cases of the concept of classical pro- 
bability shows clearly the role of subj ectiveignorance, 
as well as that of objective knowledge in a typical 
probability statement. It has been often recognized 
that any probability statement, being a rigorous 
statement involving uncertainty, has less factual 
content than an assertion of certain fact would have, 
and at the same time has more factual content than a 
statement of complete ignorance. The knowledge 
required for such a statement refers to a well-defined 
aggregate, or population of possibilities within which 
the limiting frequency ratio must be exactly known. 
The necessary ignorance is specified by our inability 
to discriminate any of the different sub-aggregates 
having differentlimiting frequency ratios, such as must 
always exist. Laplace's definition of probability, in 
which he actually speaks of "événemens", is so 


[o 
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worded that this necessary stipulation of ignorance 
in respect of particular events can be transferred to 
hypotheses, so as to imply in Boole's words “the assign- 
ing to different states of things of which we know 
nothing, and upon the very ground that we know 
nothing, equal degrees of probability". Has Laplace 
not in fact passed unawares from proposition (a) 
below to proposition (b)? 

(a) A possible outcome must be assigned equal 
probabilities in different future throws,because we can 
draw no relevant distinction between them in advance. 

(b) Hypotheses must be judged equally probable 
a priori if no relevant distinction can be drawn 
between them. 

How extremely conservative was the tradition of 
mathematical teaching is shown by the slowness with 
which the opinions of Boole, Venn and Chrystal were 
appreciated. The reluctance naturally felt to 
abandoning a false start was certainly enhanced by 
the fact that, so far as the problem of scientific 
induction was concerned, nothing had been put for- 
ward to replace that which had been taken away. The 
gap seems to have been felt only subconsciously. In 
many cases it must have been clear that it was 
possible for data of great value to the formation of 
our scientific ideas to be presented, and yet for there 
to be no defensible basis, in the light of the criticisms 
which had been made, for the application of Bayes’ 
theorem. Many mathematicians must have felt that 
with a proper restatement, the theorem, or one ful- 
filling the same purpose in inductive reasoning, could 
be set on its feet again. Indeed, the two leading 
statisticians in England at the beginning of the 
twentieth century, K. Pearson (1920) and F. Y. 
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Edgeworth (1908)? (p. 387) both put forward at- 
tempts, discordant indeed and both abortive, to 
justify the mode of reasoning in which no doubt each 
had been brought up. 

The reader of this preliminary chapter will have 
seized my meaning if he perceives that the different 
situations in which uncertain inferences may be 
attempted admit of logical distinctions which should 
guide our procedure. That it may be that the data 
are such as to allow us to apply Bayes' theorem; or, 
secondly, that we may be able validly to apply a test of 
significance to discredit a hypothesis the expectations 
from which are widely at variance with ascertained 
fact. If we use the term rejection for our attitude to 
such a hypothesis, it should be clearly understood 
that no irreversible decision has been taken; that, as 
rational beings, we are prepared to be convinced by 
future evidence that appearances were deceptive, and 
that in fact a very remarkable and exceptional 
coincidence had taken place. Such a test of signifi- 
cance does not authorize us to make any statement 
about the hypothesis in question in terms of mathe- 
matical probability, while, none the less, it does 
afford direct guidance as to what elements we may 
reasonably incorporate in any theories we may be 
attempting to form in explanation of objectively 
observable phenomena. Thirdly, the logical situation 
we are confronted with may admit of theconsideration 
of a series, or, more usually, of a continuum of hypo- 
theses, one of which must be true, and among which 
a selection may be made, and that selection justified, 
so far as may be, by statistical reasoning. The 
stasis or deadlock which had set in by the turn of the 
century has been, I shall hope to show, in fact, 
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released by the consideration of these diverse 
possibilities. 
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CHAPTER III 


FORMS OF QUANTITATIVE INFERENCE 


1. The simple test of significance 


While, as Bayes perceived, the concept of Mathe- 
matical Probability affords a means, in some cases, 
of expressing inferences from observational data, 
involving a degree of uncertainty, and of expressing 
them rigorously, in that the nature and degree of the 
uncertainty is specified with exactitude, yet it is by 
no means axiomatic that the appropriate inferences, 
though in all cases involving uncertainty, should 
always be rigorously expressible in terms of this 
same concept. Although this belief seems to have 
been unquestioned over the period of 150 years 
covered by the discussion of Chapter II, familiarity 
with the actual use made of statistical methods in the 
experimental sciences shows that in the vast majority 
of cases the work is completed without any statement 
of mathematical probability being made about the 
hypothesis or hypotheses under consideration. The 
simple rejection of a hypothesis, at an assigned level 
of significance, is of this kind, and is often all that is 
needed, and all that is proper, for the consideration of 
a hypothesis in relation to the body of experimental 
data available. It is therefore desirable to examine 
the logical nature of this sort of uncertain inferences. 

The example chosen by Boole of Michell's calcula- 
tion with respect to the Pleiades will serve as an 
illustration. He demonstrates that Bayes’ method 
can be applied to this case only by assuming arbitrary 
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values not provided by the data, and therefore that 
no probability a posteriori can be assigned to the 
hypothesis that the stars, down to the sixth magni- 
tude, are distributed at random over the celestial 
sphere. He does not emphasize that nevertheless 
Michell had by his calculations presented a strong 
reason for rejecting this hypothesis, or attempt to 
exhibit just how such a rational inference should be 
correctly stated. 

Michell supposed that there were in all I500 stars 
of the required magnitude and sought to calculate the 
probability, on the hypothesis that they are individ- 
ually distributed at random, that any one of them 
should have five neighbours within a distance of a 
minutes of arc from it. I find the details of Michell's 
calculation obscure, and suggest the following argu- 
ment. 

The fraction of the celestial sphere within a circle of 
radius a minutes is, to a satisfactory approximation, 


a 2 
in which the denominator of the fraction within 
brackets is the number of minutes in two radians, 
So, if a is 49, the number of minutes from Maia to its 
fifth nearest neighbour, Atlas, we have 


if I 
P (140-316)? 19689 ` (15) 
Out of 1499 stars other than Maia of the requisite 


magnitude the expected number within this distance 
is therefore 


1499 LA 
19689 13-1345 
The frequency with which 5 stars should fall 


07613. (16) 
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within the prescribed area is then given approximately 
by the term of the Poisson series 


5 
"Li 
51” 


or, about I in 50,000,000, the probabilities of 
having 6 or more close neighbours adding very 
little to this frequency. Since 1500 stars have each 
this probability of being the centre of such a close 
cluster of 6, although these probabilities are not 
strictly independent, the probability that among 
them any one fulfils the condition cannot be far 
from 30 in a million, or I in 33,000. Michell arrived 
at a chance of only I in 500,000, but the higher 
probability obtained by the calculations indicated 
above is amply low enough to exclude at a high level 
of significance any theory involving a random 
distribution. 

The force with which such a conclusion is supported 
is logically that of the simple disjunction: Either an 
exceptionally rare chance has occurred, or the theory 
of random distribution is not true. 

In view of the efforts which have been made to 
force a frequency interpretation on to such a dis- 
junction, it is to be noted that the mental reluctance 
to accept an event intrinsically improbable would 
still be felt if, for example, a datum were added to 
Michell's problem to the effect that it was a million 
to one a priori that the stars should be scattered at 
random. We need not consider what such a state- 
ment of probability a priori could possibly mean in 
our astronomical problem; all that is needed is that 
if this datum were introduced into the calculation, 
then, in view of the observations, a probability 


(17) 


e 
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statement could be inferred a posteriori, to the effect 
that the odds were 30 to 1 that the stars really had 
been scattered at random. The inherent improb- 
ability of what has been observed being observable 
on this view still remains in our minds, and no 
explanation has been given of it. It has been over- 
weighted, not neutralized, by the even greater 
supposed improbability of the universe chosen for 
examination being of the supposedly exceptional 
kind in which the stars are not distributed at random. 
The observer is thus not left at all in the same state 
of mind as if the stars had actually displayed no 
evidence against a random arrangement, although he 
would have been forced logically to admit that (so 
far as statements in terms of probability went) such 
a theory was probably true, and that the remarkable 
features that had attracted his attention were, 
incredible as it might seem, wholly fortuitous. 

The example shows that the resistance felt by the 
normal mind to accepting a story intrinsically too 
improbable is not capable of finding expression in 
any calculation of probability a posteriori. The 
variety of ways in which this resistance does express 
itself very well exhibits its reality. Common 
reactions are: 

(a) The whole thing is a fabrication. 

(b) There is no sufficient reason to think that the 
facts were observed and put on record accurately. 

(c) There has been exaggeration, and the omission 
of circumstances that would help to explain what is 
claimed. 

(d) Some occult cause, beyond our present under- 
standing, must be invoked. 


In the studies known as parapsychology enormous 


FORMS OF QUANTITATIVE INFERENCE 4I 


odds are often claimed, evidently with a view to 
raising the resistance felt to accepting what is 
intrinsically improbable to such a pitch that con- 
clusion (d), although itself repugnant, shall be 
accepted in preference. The incredulous, however, 
tend to prefer explanations of types (a), (b) or (c) 
either to accepting such a claim as, let us say, 

precognition", or, what seems almost always to be 
the last choice, to the acceptance as genuine of a 
very rare contingency. 

The fact, important for the understanding of 
logical situations of this kind, that reluctance to 
accept a hypothesis strongly contradicted by a test 
of significance is not removed, though it may be 
outweighed, by information @ priort, is exhibited also 
by the consideration that if the proposed datum, 
“The odds are a million to one 4 priori that the stars 
should really be distributed singly and at random"— 
if this datum were considered as a hypothesis, it 
would be rejected at once by the observations at a 
level of significance almost as great as the hypothesis, 
"The stars are really distributed at random", was 
rejected in the first instance. Were such a conflict of 
evidence, as has here been imagined under discussion, 
not in a mathematical department, but in a scientific 
laboratory, it would, I suggest, be some prior 
assumption, corresponding to an axiom or à datum 
in a mathematical argument, that would certainly 
be impugned. 

The attempts that have been made to expla 
cogency of tests of significance in scientific resear 
by reference to hypothetical frequencies of possible 
Statements, based on them, being right or wrong, thus 
seem to miss the essential nature ofsuchtests. Aman 


in the 
rch, 
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who "rejects" a hypothesis provisionally, as a matter 
of habitual practice, when the significance is at the 
I% level or higher, will certainly be mistaken in not 
more than 1% of such decisions. For when the 
hypothesis is correct he will be mistaken in just 1% of 
these cases, and when it is incorrect he will never be 
mistaken in rejection. This inequality statement 
can therefore be made. However, the calculation is 
absurdly academic, for in fact no scientific worker 
has a fixed level of significance at which from year to 
year, and in all circumstances, he rejects hypotheses; 
he rather gives his mind to each particular case in the 
light of his evidence and his ideas. Further, the 
calculation is based solely on a hypothesis, which, in 
the light of the evidence, is often not believed to be 
true at all, so that the actual probability of erroneous 
decision, supposing such a phrase to have any 
meaning, may be much less than the frequency 
specifying the level of significance. To a practical 
man, also, who rejects a hypothesis, it is, of course, a 
matter of indifference with what probability he might 
be led to accept the hypothesis falsely, 
he is not accepting it. 

On the whole the ideas (a) 
must be regarded as one of 
applied to a succession of si 
(b) that the purpose of the 
"decide" between two or 
greatly obscured their under 
as contingent possibilities 
to their logic. The appreci 


for in his case 


that a test of significance 
a series of similar tests 
milar bodies of data, and 
test is to discriminate or 


nature of a test of significan O a singl 
hypothesis by a unique body of observations, “ak 
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Though recognizable as a psychological condition 
of reluctance, or resistance to the acceptance of a 
Proposition, the feeling induced by a test of signifi- 
cance has an objective basis in that the probability 
statement on which it is based is a fact communicable 
to, and verifiable by, other rational minds. The 
level of significance in such cases fulfils the conditions 
of a measure of the rational grounds for the disbelief 
it engenders. It is more primitive, or elemental 
than, and does not justify, any exact probability 
statement about the proposition. 

When a prediction is made, having a known low 
degree of probability, such as that a particular throw 
with four dice shall show four sixes, an event known 
to have a mathematical probability, in the strict 
Sense, of 1 in 1296, the same reluctance will be felt 
towards accepting this assertion, and for just the 
Same reason, indeed, that a similar reluctance is 
Shown to accepting a hypothesis rejected at this 
levelof significance. There is the logical disjunction: 
Either an intrinsically improbable event will occur, 
or, the prediction will not be verified. The psycho- 
logical resistance has been, I think wrongly, ascribed 
to the fact that the event in question Aas, in the 
proper sense of the Theory of Probability, the low 
probability assigned to it, rather than to the fact, 
Very near in this case, that the correctness of the 
assertion would entail an event of this low probability. 
The probability statement is a sufficient, but not a 
necessary, condition for disbelief in this degree. The 
difficulty of traditional forms of expression, in this as 
in other cases, flows from the assumption, too widely 

Mathematical Probability, being 


disseminated, that i 
aT defined concept, and for a long while the 
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only one available, for the expression of statements 
of uncertainty, must necessarily be by itself com- 
petent for the adequate specification of uncertainty, 
that is of the grounds for belief or disbelief, in all 
logical situations. The logical consequences of a state- 
ment of Mathematical Probability are clear and well- 
known. They allow of the calculation of long-run 


frequency-ratios, and therefore to the -habitual 
gambler, of long- 


hypothetical and i 
played fairly wi 


the bearing of observable facts o 
of possible hypotheses, 

In general tests of signific 
thetical probabilities calculated from their null 
hypotheses, They do not generally lead to any 
probability statements about the rea] world, but to a 
rational and well-defined Ineasure of reluctance to 
the acceptance of the hypotheses they test. 

h 


ance are based on ypo- 


, Closely related to the assumption that all expres- 
Sions of uncertain knowledge must have the same 
logical form, namely that of a Statement of prob- 


ability, is the assumption that al] kinds of evidence 
used as data for such inf 


: erences have the same kind 
of logical consequences. Even a writer so shrewd as 
JM. Keynes (921) has exposed himself to this 
criticism, _Inan illuminating Passage he objected to 
the traditional Wey ior Speaking, which he ascribed to 
ee and his followers of à probability as 

unknown 5 D : own through lack 
of skill in arguing from given Evidence, or unknown 
through lack of evidence? The firs is alone Panis 


FORMS OF QUANTITATIVE INFERENCE 45 


sible, for new evidence would give us a new 
probability, not a fuller knowledge of the old 
one"? (p. 31). 

In this he must not, as might at first be thought, be 
taken to assert that in all cases a probability state- 
ment can be made, for he says later on the same page 

"We ought to say that in the case of some arguments 
a relation of probability does not exist, and not that 
it is unknown". This admission greatly reduces the 
force of his objection, for if, at one stage, a prob- 
ability does not exist, but with further observations 
it becomes possible to assert a definite probability, it 
1s not a misuse of words to say that it was at first 
unknown, and has later been ascertained. However, 
for the cases of which he was thinking, I have the 
Same preference as Keynes, and welcome his state- 
ment that in some cases no probability exists. In 
other cases which go back historically so far as Bayes, 
we undoubtedly have, or can logically conceive of 
having, partial knowledge of a probability, so that 
Probability statements can be made about its value, 
and it would be stretching language unprofitably tosay 
that in such cases the probability only partially exists. 

Keynes' difficulty arises, I think, from a desire to 
use the word probability primarily to refer to the 
truth of propositions, and only to the occurrence of 
events in the sense that the probability of an event 
can be identified with the probability of the truth of 
hat the event shall occur. But 
events, such as the disintegration within a deter- 
mine time of an atom of a radio-active element may 
be reasonably thought of as having an objective 
probability, independent of the state of our evidence, 
and this may Þe unknown, or known with a limited 


the proposition t 
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and determinate precision. The kind of evidence of 
' which Keynes is thinking is the kind which, as it is 
increased indefinitely, would lead the probability 
inferred to tend to o or I, that is to a statement 
without uncertainty. He does not notice that other 
kinds of evidence may lead to the estimation of an 
objective probability with increasingly high pre- 
cision, so as to tend in the limit to an exact know- 
ledge of its value; or, that those other kinds of 
evidence which can lead to no statement of prob- 
ability whatever, may be of direct inferential value. 


2. More general hypotheses 


Scientific hypotheses usually differ from the simple 
hypothesis (the random distribution of the stars) 
considered by Michell, in that they allow of one or 
more parameters, or adjustable “constants”, any 
value of which, or, any value within assigned limits, 
is consistent with the hypothesis. With respect to 
such hypotheses tests of significance may be applied 
in two ways. In the first place, a test of significance 
may be developed capable of rejecting the hypothesis 
as a whole, if any relevant feature of the observa- 
tional record can be shown to depend on a contin- 
gency which is suffitiently rare whatever may be the 
value of the parameter. Secondly, if no such feature 
obtrudes itself, or, if any such as can be found 
is far-fetched and artificial, so that the general 


hypothesis is judged provisionally acceptable, the 


mathematical probability to 
that the parameter lies bet 
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In choosing the grounds upon which a general 
hypothesis should be rejected, the experimenter will 
rightly consider all points on which, in the light of 
current knowledge, the hypothesis may be imper- 
fectly accurate, and will select tests, so far as possible, 
Sensitive to these possible faults, rather than to 
others. Sometimes also he may make a comprehen- 
Sive test, such as Pearson's test of Goodness of Fit, 
applied to observed frequencies, which though 
strictly speaking approximate, and inapplicable to 
small frequencies, is, albeit not specifically sensitive 
to particular faults, yet, when the frequencies are not 
too small, capable of detecting sufficiently pro- 
nounced discrepancies of any kind. 

In experimental breeding the data available for 
theoretical discussion are the frequencies observable 
When a progeny, or group of progenies, is classified 
into a number of classes (phenotypes) the members of 
Which are distinguishable by inspection, or by other 
tests. Thus in an intercross between organisms each 
heterozygous in respect of two genetic factors 
Showing complete dominance, the expectations out 
of n observed will be 


n s 
76 (9 3 3 9. (18) 

if the factors are unlinked, but in general 
get 0, 1—0, 1—9, 8), (19) 


Where 0 is a parameter, depending on the closeness of 
linkage, any value of which between o and 1 would 


be i igible. ‘ 
E nell x2 method, if a, As, As, a, are 
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frequencies observed corresponding with frequencies 


m, to m, expected, the measure of discrepancy 
is 


Se |S a 


and the probability with which the value of x? 
observed will be exceeded by chance for three degrees 
of freedom can be found from the well-known tables. 
When the expectation is calculated for factors 
inherited independently, the x2 value is divisible into 
three positive and independent parts. 


I 
Zeer 43—343—344)? , 


I 
351-346 435—304)? , (21) 


and or 342 34a 924)? , 


each of which is distributed, on the hypothesis 
tested, as x? for one degree of freedom, that is, as the 
square of a normal deviate having unit variance. 
The last of these three parts is Specifically sensitive 
to a disturbance of the expectations of the kind due 
tolinkage; the first two are sensitive to disturbances 
of the single-factor ratios, and if the only disturbance 
really present is a small effect of linkage, the test 
using three degrees of freedom including them also 
will be less sensitive to the linkage effect than that 
based on only one. 

To test the general hypothesis the 
for large samples is to estimate the 
efficient estimate, to find the corres 


usual procedure 
value of o, by an 


ponding expecta- 
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ons, and to ascertain the level of significance of the 
x? value obtained, using the Table at two degrees of 
freedom. A low probability indicates that the 
general hypothesis is to be rejected at the level of 
Significance found; in other words, there is no value 
of 6 for which the hypothesis is acceptable. We do 
not simply use the sum of the components of x? for 
the two single-factor ratios in this test, for if linkage 
Were present these two components would no longer 
p independent, and to combine the evidence of the 
wo requires an estimate of the parameter. This 
problem has been discussed in detail in Statistical 
Methods, section 55.” 

Although in testing the acceptability of a hypo- 
thesis in general, no reference to the theory of 
estimation is required, at least in this important class 
of cases, beyond the stipulation that expectations can 
be fitted efficiently to the data, yet when the general 
hypothesis is found to be acceptable, and accepting 
it as true, we proceed to the next step of discussing 
the bearing of the observational record upon the 
problem of discriminating among the various possible 
values of the parameter, we are discussing the theory 


of estimation itself. In this theory a case of peculiar 


simplicity arises when an estimate exists which, 
th ancillary statistics, 


Perhaps in conjunction wi 
subsumes the whole of the information, relevant to 
the parameter, supplied by the observational record. 
Such estimates are termed exhaustive, and their 
special property may be expressed. in various ways: 
(i) that, given the exhaustive statistic, every other 
estimate possible has a sampling distribution com- 

dent of the parameter to be estimated, 


letely indepen à : 
Pad: i) that the Likelihood Function of the para- 


D 
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meter inferred from the sampling distribution of the 
exhaustive estimate and its ancillaries is exactly the 
Same as that inferred from the original observations. 
In fact for all purposes of inference an exhaustive 
statistic, in association perhaps with certain ancillary 
values, which themselves are independent of the 
parametric value, can replace the entire observational 
record from which it was calculated. 

Exhaustive estimates do not always exist; it is 
therefore important to note that their existence is the 
first requirement of the mode of inductive reasoning 
to be developed below as the Fiducial Argument. A 
Second condition to be noted is that the observa- 
tions should not be discontinuous, as are frequencies, 
but should be measurements sufficiently accurate to 
be regarded without Significant error as observed 
values of continuous variates, so that the statistics 
calculable from them shall have continuous dis- 
tributions. In making these distinctions, I do not 
wish to deny either that measurements, however 
accurate, are in the strictly mathematical sense 
discontinuous, or that counts may be of numbers so 
large that they could without sensible error be 
treated as measurements of continuous variables, Tt 
is only necessary to point out that Cases com 


: ARa io agen monly 
occur to which this distinction is relevant, In the 
same way, it may be said that it can always be 


imagined that statistical samples 
that the distinction between exhaustive and other 
efficient estimates shall become unimportant i 
that is needed is to recognize that Samples a t 
always so large as this, and that in Such ca e no 
logical consequences which flow from this di s Ses the 
are not irrelevant. tstinction 


are made so large 
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3. The fiducial argument * 


_The term fiducial has been introduced to dis- 
tinguish a particular form of inductive reasoning 
from that of Bayes, which for contrast may be 
termed the Bayesian argument. The distinction was 
needed because, like the method of Bayes, it leads to 
Probability statements applicable in the light of the 
Observations to an unknown parameter. Whereas, 
however, theargument of Bayes requiresa distribution 
g priori involving probability statements of the same 
logical form as those finally obtained a posteriori, the 
application of the fiducial argument can only be 
Made in the absence of such information a priori. 
In the Bayesian argument the observations are used 
to convert a random variable having a well-defined 
distribution a priori to a random variable having an 
equally well-defined distribution a posteriori; and it 
1s well known that, if the observations are increased 
In number their importance grows relatively to that 
of the information supplied a priori, so that the 
latter becomes less and less influential upon the 
conclusions. By contrast, the fiducial argument uses 
the observations only to change the logical status of 
the parameter from one in which nothing is known 
of it, and no probability statement about it can 
be made, to the status of a random variable having 
a well-defined distribution. 

If direct and exact observations could be made 


statements derived by arguments of the fiducial 
type have often been called statements of “fiducial probability". 
This usage is a convenient one, SO long as it is recognized that 
the concept of probability involved is entirely identical with 
the classical probability of the early writers, such as Bayes. It is 
only the mode of derivation which was unknown to them, 


* Probability 
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on the parameter itself, a similar change of logical 
status would be effected by the observation of its 
value, from one in which it was wholly unknown, or 
had perhaps a known frequency distribution, to one 
in which it could be assigned a definite value. It is, 
therefore, perhaps not surprising that similar exact 
Observations, though not on the parameter itself yet 
on variates having distributions known in terms of 
the parameter, should be able in favourable cases to 
effect, at a lower level, a similar change of status. 

As an example of the mode of reasoning, consider a 
radio-active source emitting particles with unknown 
frequency at instants completely independent of each 
other. The interval of time between two successive 
emissions will then be distributed at random in the 
exponential distribution 


df = 0e-** dx , (22) 
in which 6 is the unknown average number of 
emissions in each time unit. We may conceive such 


time intervals to be accurately measurable, and that 
a record of n of them has Shown intervals 


eal LINE RET 
We suppose that these c 


onform sufficientl IE 
expectations based on th. PT UIS 


e estimate 


where cn (23) 


and X stands for the sum of the times o ; 

to be agreed that the general hypothesis isan ori 
and that what remains is only to make mer a bj 
ability statements about the value of 6 as th prob- 
allow. € data 
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_ From the original data, the n observed time 
intervals being independent, the Mathematical Likeli- 
hood of any value 0 which the parameter may take is 
Seen to be proportional to 

ng-9X ' 


Which is maximized at the value 


n 
o=: 
So that the estimate T chosen above is an estimate of 
Maximum likelihood. Itisalso a Sufficient Estimate, 
that is to say an exhaustive estimate without 
ancillary statistics, for the sampling distribution of 
1s seen easily to be 


n-1l 
df ong. ns em dX, (24) 


giving exactly the same likelihood function for 0 as 
Was given by the original data. The distribution of 
also is continuous over all positive values, for all 
values of 6. 
This distribution of X for any given @ is, in fact, 
equivalent to the distribution of the quantity x? for 
2n degrees of freedom, if x? is equated to 


0 
x? = 20X = 2n. (25) 


2 distribution is exact, and not only 
Pearson’s measure of discrepancy 
consequently if we choose any 


In this case the x 
approximate as is 
for frequencies; s 
probability P, and write 

Xn (P) 
alue which is exceeded for 2n degrees of 


for that v h frequency P,a value which is calculable 


freedom wit 
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with exactitude for all n and P, it appears that the 
statement 


T 2 
0>= xi, (P) (26) 


is verified with the frequency P, for all values of P 
chosen, and therefore that we have derived formally 
a frequency distribution of the unknown parameter 0 
appropriate to the observations available. 

The applicability of the probability distribution to 
the particular unknown value of à sought by an 
experimenter, without knowledge a priori, on the 
basis of the particular value of T given by his 
experiment, has been disputed, and certainly deserves 
to be examined, especially as in the first case in which 
I exhibited this form of argument, namely to the 
correlation coefficient (1930);! though the example was 
appropriate, my explanation left a &ood deal to be 
desired. 

The reasoning developed so far has been entirely 
deductive; the example was chosen, however, to 
bring out some necessary characteristics of inductive 
reasoning. The probability statement first developed 
above (24) had as reference set all the values of T 
which might have occurred in unselected samples for 
a particular value of 0. Tt has, however, been proved 
for all values of 0, and so is applicable to the enlarged 
reference set of all pairs of values (T, 6) 
from all values of 9. The Particular Pair of values of 
0 and T appropriate to & particular experimenter 
certainly belongs to this enlarged set, and Within this 


set the proportion of cases satisfying the inequalit 
(26) y 


T 2 
CN xi (P) 
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IS certainly equal to the chosen probability P. It 
might, however, have been true, as in the case of a 
gambler throwing a single die, discussed on page 32, 
that in some recognizable sub-set, to which his case 
belongs, the proportion of cases in which the in- 
equality was satisfied should have some value other 
than P. It is the stipulated absence of knowledge 
4 priori of the distribution of 0, together with the 
exhaustive character of the statistic T, that makes 
the recognition of any such subset impossible, and 
SO guarantees that in his particular case, as in the 
case of a single particular throw contemplated by the 
gambler, the general probability is applicable. 

Had knowledge a priori been available, the argu- 
Ment developed above would have been precluded by 
the consideration that some of the relevant data had 
been omitted. For, although in the deduction of 
Statements of certainty it is legitimate to draw 
inferences from some of the axioms available while 
lgnoring others, or, in other words to base a valid 
argument on a chosen subset only of the available 
axioms, no such liberty can be taken with statements 
of uncertainty, where it is essential to take the whole 
of the data into account, though some part of it may 
be shown on examination to be irrelevant, and not to 


affect the result. 


Again, had there been knowledge a priori, the 


argument of Bayes could have been developed, which 
would have utilized all the data, and which would 
in general have led to a distribution a posteriori 
different from that to which the fiducial argument 
leads. Bayes’ method in fact calculates the dis- 
tribution of @ in a particular subset of pairs of 
values (T, 8), defined by T, and to which therefore the 
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observation belongs. Consequently, it is essential to 
introduce the absence of knowledge a priori as a dis- 
tinctive datum in order to demonstrate completely 
the applicability of the fiducial method of reasoning 
to the particular real and experimental cases for 
which it was developed. This last point I failed to 
perceive when, in 1930, I first put forward the fiducial 
argument for calculating probabilities. For a time 
this led me to think that there was a difference in 
logical content between probability statements 
derived by different methods of reasoning. There are 
in reality no grounds for any such distinction. 
Various writers, including Sir Harold Jeffreys* and 
A. Kolmogorov,5 recognizing the rational cogency of 
the fiducial form of argument, and the difficulty of 
rendering it coherent with the customary forms of 
statement used in mathematical probability, have 
proposed the introduction of new axioms to bridge 
what was felt to be a gap. The treatment in this 
book involves no new axiom; it does, however, rely 
on a property inherent in the semantics of the word 
“probability”, though not required explicitly so long 
as the applicability in the real world of the logical 
relationship denoted is not in question. Purely 
abstract studies of the formal mathematics of 


be developed without 
€ word’s meaning. It is 
mathematical definition 


nature and extent, not of the 
of the ignorance, required 
envisaged, and that so long 
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purely deductive reasoning is proper, that valid 
deductions can be drawn from every subset of the 
axiomatic material available, it can be argued, as by 
Venn, that “his ignorance affects himself only, and 
corresponds to no distinction in the things”. Mathe- 
matical probability, however, as conceived by the 
early writers, was applicable to the real world, and to 
make it available not only in deductive, but also in 
inductive reasoning a more complete definition is 
required. The subject of a statement of probability 
must not only belong to a measurable set, of which a 
known fraction fulfils a certain condition, but every 
subset to which it belongs, and which is characterized 
by a different fraction, must be unrecognizable. 


4. Accurate statements of precision 
The possibility of making exact statements of 
probability about unknown constants of Nature 
supplies a need long felt of making a complete 
Specification of the precision with which such con- 
stants are estimated. For example, if, in the case 
considered in the foregoing section, there had been 
500 accurately measured time intervals, calculations 
based on the distribution of x? for 1000 degrees of 
freedom would show that the probability was 25% 
of the true value lying below -96957 of the estimate, 
and 25% of lying above 1:02988, times the same 
quantity. These values then bracket a central 
region of about 6% within which the true value will 
lie with a probability of just one half. They there- 
fore fulfil the same function as the traditional 
> often given in respect of astro- 


“ ror’ 
Prod observations The concept of the probable 
the desire felt for such probability 


error indicates 
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statements, although the great complexity of the 
observational material to be reduced in many 
astronomical calculations has stood in the way of the 
refinement of the concept, and has indeed introduced 
great difficulties in the way of obtaining a reliable 
figure. When, as in the example chosen, the data are 
simple and the meaning of the calculations com- 
pletely clear, other relevant probability statements 
may be made with equal confidence and exactitude. 
For example, the probability is 5% each way of the 
true value lying outside the limiting ratios -92731 
and 1-07439, and it is only 1% of it lying below 
‘89819 and another I% of lying above 1-10622, SO 
that the odds are 49 to r that it should lie within these 
last limits. The fiducial distribution in this way 
comprises a complete set of probability statements 
appropriate to any chosen level of probability, or to 
any chosen limits. In such cases the precision of the 
ompletely specified. 

smaller number of time-measure- 
ments, such as 1 5, the precision of the estimate would 
have been lower, but the fiducial limits and corres- 
pending Ace of Probability would still have 

een exact. Compendioy i 
may be stated A RE 2A e eli 


Excluding Lower 


Uppe 

at each end limit ME 
1% 4984  r6964 

576 -6164 I'459r 
25% "8159 I-1600 


There is here a probability of over 175 that th 
value should be less than half the estimate, ne 
The objection has been raised that since any state 
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E o: po to be objective must be verifiable 
RED à iction of frequency, the calculations set out 
M Toa lead to a true probability statement 
im cm n a particular value of T, for the data do 
to Pd vide the means ofcalculating this. This seems 
ER ume that no valid probability statement can be 
cm E by the use of Bayes' theorem. How- 
En e aggregate of cases of which the particular 
oa RU case is one, for which the relative 
mz of satisfying the inequality statement is 
acts be P, and to which all values of T' are 
Ert e, could certainly be sampled indefinitely to 
dba S5 rate the correct frequency. In the absence 
a prior distribution of population values there is 
1 meaning to be attached to the demand for 
calculating the results of random sampling among 
populations, and it is just this absence which com- 
pletes the demonstration that samples giving a 
Particular value T, arising from a particular but 
unknown value of 6, do not constitute a distinguish- 
able sub-aggregate to which a different probability 
Should be assigned. Probabilities obtained by a 
fiducial argument are objectively verifiable in exactly 
the same sense as are the probabilities assigned in 


games of chance. 
It has, as was shown in the previous chapter, been 
a hope or ambition among many mathematicians of 
the concept of Mathe- 


the last two hundred years that 
matical Probability should be found to be applicable, 


not only to idealized games of chance, but to practical 


affairs, and in particular toinferences in the Natural 
SrioHcasme Lhe fiducial argument demonstrates at 
least one meaningful application beyond that for 
which it was originally defined, and without needing 
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the knowledge priori required for Bayes' method of 
reasoning. 

It may be noticed in the example chosen above, 
and in other cases in which a probability distribution 
can be calculated by the fiducial argument, that the 
region containing, for example, the lowest 1% 
of the frequency distribution is exactly that com- 
prising values of the parameter which would have 
been rejected as too low by a valid test of significance 


for the unknown parameter. The direct step from 
the test of significance to a probability distribution 


5. Discontinuous Observations 
The data discussed by Ba 
have been observed out of (a+b) trials are discon- 


tous ll examples suitable for 
exhibiting the fiducial argument. As has been 
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example, if both a and 5 are counts of many millions 
or hundreds of millions of independent instances, the 
probability of success might, if only ordinary levels 
of precision. were in view, be recognized to have 
effectively the status of a directly observable quantity 
namely, 
l p=al(a+d), (27) 
in which we may ignore the logical fact that p is 
equated to an estimate affected by errors of random 
sampling. At least we may ignore this fact after 
having satisfied ourselves, by reference to a more 
exact treatment, that the precision is such that for 
the purposes for which we need it, the observed value 
is sufficiently accurate, with sufficiently high prob- 
ability. 

If the frequencies are of intermediate size, of the 
order, let us say, of 1000 or 10,000, we probably 
Should not be willing to ignore the sampling error, but 
might be content, as in the case of the x? test of 
Goodness of Fit, to ignore the discontinuity of the 
observations. Comparing the expectations for any 
assigned value of 2, with the observations, we should 


have 
Expected Observed 
pn a 
qn 0 

leading to A 
(a-p _ 52, 
pala+b) (28) 
qa—pb _ 


= , 


MEZTCD 
e for one degree of freedom. 


as the test of significanc 
buted with unit variance. 


x is then normally distri 
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Choosing any appropriate significance level, such as 
2%, the value of x? is known, in this case to be 5:412, 
and substituting this value, we have a quadratic 
equation, giving the two values of $ for which the 
deviation from the observations is exactly at the 2% 
level of significance. The probability that p should 
be less than the lower root of the equation is 1%, as 
is that of it exceeding the higher root. Similar 
values, excluding 5% at each end, could be obtained 
by taking 2:706 for x?. When the discontinuity of 
the observations can be ignored, the fiducial argu- 
ment justifies statements of probability, concerning 
the unknown value 5, which is thus estimated as E 
random variable with a precision defined by a con- 
sistent aggregate of such probability statements, 

Since the fiducial argument in full strictness cannot 
be applied, owing to the actual discontinuity of the 
data, it would be improper to regard this distribution, 
though precise in form, as more than an asymptotic 
solution of Bayes' problem, when knowledge a priori 
is absent. As such, it is, however, relevant that the 
mean value of p for given observational frequencies 
a, b may be expressed as 


p= xt FEV | eae du} 22492822) 
or 
ZEIT Ne M UE (28.2) 


Now the mean of Bayes' distribution a posteriort 
admits of the expansion 


Gil ae a 2(b—a 
N+2 NN — ue a "mos (ius) 
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Montes had Bayes’ calculation been carried out with 
e probability element a priori given by (6) 
I 
the mean would have been 
= at a b-a (b—a 
M A Oe a A 
E asymptotic agreement with the fiducial dis- 
Ee ARa so far as the term in N-?, Bayes’ postulated 
" ribution a priort should have been that given 
ah expression (6), derived from the angular trans- 
i rmation. To this extent a particular given dis- 
ribution a priori may be nearly equivalent to 
complete ignorance a priori. 
Equation (28.1) is not, ind 
third term of the expansion ( 
third and fourth cumulants 0 
appreciable. If allowance be made for these by the 
method of Cornish and Fisher? (1937), an expansion 
may be obtained of the parameter p in terms of a 
normal deviate x, of which the first few terms are 


d a- vab t bed 
Dp wie a+ gN (2324-1) 
--(—2N24- 264b) (Cz Nt 342b) 72" V ab 
--((—12N2--276ab)x*4- (x7N2—644ab) x? 
N?ab 


eed, exact, since in the 
28.2) the effects of the 
f the binomial become 


4E (z9N?—148a0))(b—a) 3240 
fore 


From this expansion, it can be ascert 
mean of the fiducial distribution is 


T RE 
=n aN? a2N 


ained that the 
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agreeing so far as the third term with (28.4), the 
mean of the Bayesian distribution a posteriori, using 
the element a priori 


I 
7™V pq 

I do not know for how many terms this agreement 
continues. It should, however, be noted that the 
variances of the two distributions are not the same. 
That for the fiducial distribution is the larger by 
r/12N?. 

For smaller numbers of independent observations 
than those for which the effects of discontinuity are 
negligible, a lower logical status is recognizable in 
which neither effectively exact definitive statements, 
nor statements in terms of Mathematical Probability, 
are possible, yet in which some information is 
available, and we are not in a state of complete 
ignorance. Evidently in such cases tests of signifi- 
cance are available. Thus, to take an example 
employed in illustrating the Table of z, in the Intro- 
duction to Statistical Tables,? if 3 successes have been 
observed out of 14 trials, then probabilities of success 
exceeding +557 may be excluded by the consideration 
that for such values the total probability of observing 
0, I, 2 or 3 successes is only 196, while values below 
0331 may be excluded on the ground that the total 
probability of values from 3 to r4 would similarly 
fall below 1%. In this way tests of significance are 
available capable of finding at any level of significance 
limits outside which all values of the parameter are 
to be rejected. These have been called “Confidence 
Limits", and though they fall short in logical content 
of the limits found by the fiducial argument, and with 


dp. 
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which they have often been confused, they do fulfil 
some of the desiderata of statistical inferences. 

(i) With the aid of the Table of the z-distribution, 
or of other tables serving the same purpose of deter- 
mining the partial sum of the terms of the binomial 


expansion, 
(q- 2)", 


the Confidence Limits can be sufficiently easily 
calculated. 

(ii) They serve to divide the range of possible 
values of the unknown into à series of zones of more 


or less acceptable values. : 
(iii) The results of the calculations are readily 


communicable, and the method is sufficiently known 


to be widely understood. 

Nevertheless, from the point of view of making the 
most of limited data, and of drawing from them con- 
clusions as strong as they can properly be made, the 
system of Confidence Limits seems to provide less 
than even this comparatively uninformative type of 
data would support. ; 

It has been frequently stated, as though it were a 
characteristic property of Confidence limits, that the 
interval between them will in repeated samples cover 
the true value with the exact frequency correspond- 
ing with the level of significance chosen. E.g. that in 
98% of trials the true value would be found to lie 
between the two 1% points. This, if true, would 
give them the force of a statement of probability. 
However, actually, the true value will lie between the 
assigned limits generally in more than 98% of such 
trials, and no exact statement of probability can be 
inferred. Exactly verifiable probability statements 
are not a characteristic of Confidence limits, as they 


E 
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Likelihood is not, of course, to be confused with 
Mathematical Probability. It is, like Mathematical 
Probability, a well-defined quantitative feature of the 
logical situations in which it occurs, and like Mathe- 
matical Probability can serve in a well-defined sense 
as a “measure of rational belief"; but it is a quantity 
of a different kind from probability, and does not 
obey the laws of probability. Whereas such a phrase 
as “the probability of A or B” has a simple meaning, 
where A and B are mutually exclusive possibilities, 
the phrase “the likelihood of A or B” is more parallel 
with "the income of Peter or Paul"—you cannot 
know what it is until you know which is meant. 

In relation to the logical situations so far discussed, 
Mathematical Likelihood has already appeared as the 
factor, appropriate to each possible parametric value, 
by which each element of probability a priori is con- 
verted in Bayes’ method to the corresponding element 
of probability a posteriori. It represents that part of 
Bayes’ calculation provided by the data themselves. 
With regard to simple tests of significance, if such 
tests were performed on the same data against 
several mutually exclusive hypotheses, since the 
likelihood of any hypothesis is proportional to the 
probability, accepting that hypothesis as true, of 
such observations occurring as have been made, the 
greatest reluctance will be felt to accepting the least 
likely hypothesis, and the least reluctance to the most 
likely. Such comparisons can be made even though 
only relative values of the likelihood function are 
meaningful, The likelihood supplies a natural order 
of preference among the possibilities under con- 
sideration. It is not surprising, therefore, though 
independently demonstrable, that in the Theory of 
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Estimation, all rational criteria of what is to be 
desired in an estimate converge on the particular 
value for which the likelihood is maximized. The 
Method of Maximum Likelihood is indeed much used 
and widely appreciated in the statistical literature, 
without, I fancy, so much appreciation of the signifi- 
cance of the system of likelihood values at other 
possible values of the parameter. In the theory of 
estimation? it has appeared that the whole of the 
information supplied by a sample is comprised in the 
likelihood, as a function known for all possible values 
of the parameter. ; 

The relation between probability and likelihood in 
the case in which probabilities are accessible by the 
fiducial argument, is intimate. If @ stand for an 
unknown parameter and T for a Sufficient, or 
Exhaustive, estimate of it, and if for all values of 0 
and T, the function, 

P-F(T, 0), 

stand for the probability that a sample drawn from a 
population with parameter 0, shall yield a statistic 
less than T, both 0 and T being continuous functions 
over the same range, and F monotonic for ee 
variables, then the distribution of T for given 8 has 


the frequency element, 
aF 
at» 


so that the likelihood of any value 9 is determined by 


the relation 
oF 
(8) o- —— 
Ela aT’ 


for the particular value of T observed, and for all 


NCE 
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are of the limits that can be assigned when the 
fiducial argument is available. es: 

^. As the UE defining the level of "E 
is raised the width of the central zone still deo 
acceptable is narrowed, but it is not closed un 0%. 
Probability level is raised to considerably over 2 a 
At the value of É suggested by the data oe e 
3/14 (21:42857%), the probability of krin s 
less is 64:832% and that of observing 3 or re is 
60:402%. At $—:22, the probability of 3 or This 
62:807% and that of 3 or more is 62:394. AUR 
System of zones therefore closes at pr the 
slightly greater than 5—-22, and at TR ded 
level of significance is over 62%. Neit nae to 
Seems to have any sufficient inferential MR of 
be worthy of record and report. The me ra 
zoning by Confidence Intervals does not pick X the 
of any importance the value of 5 for whic T 
Mathematical Likelihood is greatest, or, in 0 


words, that which would have the highest probability 
of leading to the result observed. 


sometimes been made that the 
method of calculating Confidence 


ure is indeed not very 


9n. It should be 
pointed out that when the Probability of 3 or less is 


small, most of this small probability win be due to the 
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case "exactly 3”, and that the contribution of the 
other three cases is not very important, although it 
does increase or decrease with varying $ at a relative 
rate different from the contribution of “exactly 3 
itself. Similarly, at values of p at which “3 or more” 
has a low probability, a large part of this probability 
will be due to the particular case that has been 
observed, and the calculation will have been perhaps 
little influenced by the frequencies of the cases, 
Included in the calculation, which have not in fact 
been observed. 

It would, however, have been better to have 
compared the different possible values of 2, in relation 
to the frequencies with which the actual values 
Observed would have been produced by them, as 
is done by the Mathematical Likelihood, a function’ 
of the unknown parameter proportional to these, 
frequencies, or in this case to 


p(r—p)’, 
having its maximum value at 
p=a(a+ b), 


and therefore expressible in terms of its maximum, as 
INE us a+b 

(2 C; CEDE 
(a+b). prap) . 


or ath? 


(29) 


The Mathematical Likelihood assignable to every 
value of the unknown parameter p supplies a zoning 
of the admissible range of values more directly 
appropriate to the observations than that provided 
by the system of Confidence belts. Mathematical 
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values of 0. Concurrently, we have seen that the 
frequency distribution of 8 to be inferred from the 
data has the frequency element 


oF 
— 3e 4. 


The likelihood function and the probability dis- 
tribution thus supply complementary specifications 
of the same situation. 

“Confidence Limits” and “Confidence Belts" were 
I think developed and advocated under the impression 
that in a wider class of cases they could provide 
information similar to that of the probability state- 
ments derived by the fiducial argument. It is clear, 
however, that no exact probability statements can 
be based upon them, and this seems now to be under- 
stood. They may be taken to supply statements of 
inequality. The tests of significance on which they 
are based are, of course, valid, but if these are used 
for zoning the possible values of the parameter, the 
zones they give do not assign exactly the same order 
of preference as that supplied by the likelihood 
function. This is due to the use of heterogeneous 
groups of possibilities of which some have been 
observed, and others have not, in making up the 
blocks for which the probabilities are calculated. 

For all purposes, and more particularly for the 
communication of the relevant evidence supplied by a 
body of data, the values of the Mathematical Likeli- 
hood are better fitted to analyse, summarize, and 
communicate statistical evidence of types too weak 
to supply true probability statements; it is important 
that the likelihood always exists, and is directly 
calculable. It is usually convenient to tabulate its 
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logarithm, since for independent bodies of data such 
as might be obtained by different investigators, the 

combination of observations" requires only that the 
log likelihoods from different sources should be 
added. 

In the case under discussion a simple graph of the 
values of the Mathematical Likelihood expressed as a 
percentage of its maximum, against the possible 
values of the parameter 2, shows clearly enough what 
values of the parameter have likelihoods comparable 
with the maximum, and outside what limits the likeli- 
hood falls to levels at which the corresponding values 
of the parameter become inplausible. In Fig. 1 the 
likelihood is plotted against ?. If instead of p, a 
transformed value, such as the normal deviate 
commonly used in Biological Assay is employed, the 
curve is transformed using invariant ordinates, and 
not, as would be the case with a frequency curve, with 
invariant areas. This is shown in Fig. 2. The areas 
under these curves are irrelevant. In each diagram 
zones are indicated showing the limits within which 
the likelihood exceeds 1/2; 1/5, and 1/15 of the 
maximum. Values of the parameter outside the last 
limits are obviously open to grave suspicion. The 
actual limits found are: 


Likelihood ^ Lower limit Upper limit 
ratio 96 % 
50% 10:5889 35:9225 
20% 6-6652 4473301 
6:695 42001 51:6876 
this example is 


The simplicity of the data chosen for 1 
an unessential accident. Very extensive observations 
may best be summarized in terms of the likelihood 
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function calculable from them. A schedule of log 


likelihoods such as that shown below, in a form 
suitable for the information of a later worker, may be 
much more compact than the data from which it has 


60 
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bee i i 
"n pour and yet convey all that is needed from 


—3logp —3 lo 

$%  —ixlog (1-5) $95 II ied 
3 4714147 20 3162920 
f 4388837 25 3:180506 
3 47148130 30 3:272558 
6 3:961139 35 3425749 
8 3:689064 40 3:634156 
29 3:503332 45 3:896373 
d 3:373147 50 4214420 
"s 3282132 55 4503574 
17 3:198794 60 5:042886 


Apart from the simple test of significance, there- 


fore, there are to be recognized and distinguished, 


between the levels of certain knowledge and of total 
f logical status for 


nescience, two well-defined levels o 
parameters lying on a continuum of possible values, 
namely that in which the probability is known for the 
parameter to lie between any assigned values, and 
that in which no probability statements being 
possible, or only statements of inequality, the Mathe- 
matical Likelihood of all possible values can be 
determined from the body of observations available. 

By means of appropriate observations a quantity 
may conceivably pass discontinuously from one 
status to another implying fuller knowledge. Alter- 
natively, the mere accumulation of data of the same 


kind on a sufficient scale may induce a kind of 
asymptotic approach to a higher status. 

The implications of such a classificatory frame- 

clear by & greater variety of 

hapters. 


work may be made 
] be discussed in Jater c 


examples such as will 


74 STATISTICAL METHODS AND SCIENTIFIC INFERENCE 


4 


5. A. N. Kolmogorov 


N 


- R. A. Fisher 
(1930). 


- R. A. Fisher and 
F. Yates 
(1938-1953). 

- R. A. Fisher 
(1925). 


. H. Jeffreys 
(1940). 


(1942). 


- E. A. Cornish and 
R. A. Fisher 


(1937). 
. R. A. Fisher 


(1925-1954). 


8. J. M. Keynes 


(1921). 


REFERENCES 


Inverse probability. 

Proc. Camb. Phil. Soc., vol. 26, pp. 528- 
535. 

Statistical Tables. 

Oliver and Boyd, Edinburgh, Ist ed. 


Theory of statistical estimation. E 

Proc. Camb. Phil. Soc., vol. 22, pp. 700- 
725. 

Note on the Behrens-Fisher formula. 

Ann. Eugen., vol. xo, pp. 48-51. 

The estimation of the mean and precision 
of a finite sample of observations. 
Section 5—Fisher’s fiducial limits and 
fiducial probability. 

Bull. Acad. Sci. U.S.S.R., Math. Ser., 
vol. 6, pp. 3-32. 

Moments and cumulants in the specifica- 
tion of distributions. 

Rev. Inst. int. Statist., vol. 4, DIEI: 

Statistical Method for Research, Workers. 

Oliver and Boyd, Edinburgh. 

A Treatise on Probability. 

Macmillan and Co., London. 


CHAPTER IV 


SOME MISAPPREHENSIONS ABOUT 
TESTS OF SIGNIFICANCE 


1. Tests of significance and acceptance decisions 


The common tests of significance, familiarly known 
as Pearson's y? test of goodness of fit (r9o0),? 
"Student" 's t-test (1908),? the z (or F) test of the 
analysis of variance (1924), and many others 
designed on the same principles, have come in the 
first two quarters of the twentieth century to play a 
rather central part in statistical analysis. In the day- 
to-day work of experimental research in the natural 
Sciences, they are constantly in use to distinguish 
real effects of importance to a research programme 
from such apparent effects as might have appeared in 
consequence of errors of random sampling, or of 
uncontrolled variability, of any sort, in the physical 
or biological material under examination. They are 
used to recognize, among innumerable examples that 
could be given, the genuineness of a genetic linkage, 
the reality of the response to manurial treatment ofa 
cultivated crop, the deterioration of a food product 
in storage, or the difference between machines in the 
frequency of defective parts produced by them. The 
conclusions drawn from such tests constitute the 
ich the research worker gains a better 
g of his experimental material, and of 
ems which it presents. 
the Prob R too, that the men who felt the 
need for these tests, who first conceived them, or 


steps by wh 
understandin, 


76 STATISTICAL METHODS AND SCIENTIFIC INFERENCE 


later made them mathematically precise, were all 
actively concerned with researches in the natural 
Sciences. More recently, indeed, a considerable body 
of doctrine has attempted to explain, or rather to 
reinterpret, these tests on the basis of quite a different 
model, namely as means to making decisions in an 
acceptance procedure. The differences between these 
two situations seem to the author many and wide, and 
I do not think it would have been possible to over- 
look them had the authors of this reinterpretation had 
any real familiarity with work in the natural Sciences, 
or consciousness of those features of an observational 
record which permit of an improved scientific under- 
standing, such as are particularly in view in the 
design of experiments. The misapprehensions, 
indeed, appear to go deeper than would be expected 
from a mere transference of techniques from one field 
of study to another. 

In various ways what are known as acceptance pro- 
cedures are of great importance in the modern world. 
When a large concern suchas the Royal Navy receives 
material from its makers, it is, I suppose, subjected 
to sufficiently careful inspection and testing to 
reduce the frequency of the acceptance of faulty or 
defective consignments. The instructions to the 
officers carrying out the tests must also, I conceive, be 
such as to keep low both the cost of testing and the 
frequency of the rejection of satisfactory lots. Much 
ingenuity and skill must be exercised in making an 
acceptance procedure a really effectualand economical 
one. Itis not therefore at allin disdain of an artifice 
of proved value, in commerce and technology, that 
I shall emphasize Some of the various ways in which 
this operation differs from that by which improved 
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theoretical knowledge is sought in experimental 
research. This emphasis is primarily necessary 
because the needs and purposes of workers in the 
experimental sciences have been so badly misunder- 
stood and misrepresented. It is, of course, also to be 
suspected that those authors, such as Neyman and 
Wald, who have treated these tests with little regard 
to their purpose in the natural sciences, may not have 
been more successful in the application of their ideas 
to the needs of acceptance procedures. Itis, however, 
to the evident advantage of both kinds of application 
that the theories developed and taught to mathema- 
ticians should not confuse their several requirements. 

In attempting to identify a test of significance as 
used in the natural sciences with a test for acceptance, 
one of the deepest dissimilarities lies in the popula- 
tion, or reference set, available for making statements 
of probability. Confusion under this head has on 
Several occasions led to erroneous numerical values; 
for, where acceptance procedures are appropriate, the 
population of lots of one or more items, which could be 
chosen for examination, is unequivocally defined. 
The source of supply has an objective empirical 
reality. Whereas, the only populations that can be 
referred to in a test of significance have no objective 
reality, being exclusively the product of the statistic- 
ian's imagination through the hypotheses which he 
has decided to test, or usually indeed of some 
Specific aspect of these hypotheses. The demand was 
first made, I believe, in connection with Behrens' 
test of the significance of the difference between the 
means of two populations of unknown variances, 
that the level of significance should be determined by 


“repeated sampling from the same population”, 
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evidently with no clear realization that the popula- 
tion in question is hypothetical, that it could be 
defined in many ways, and that the first to come to 
mind may be quite misleading; or, that an under- 
standing, of what the information is which the test 
is to supply, is needed before an appropriate popu- 
lation, if indeed we must express ourselves in this 
way, can be specified. This particular case will be 
examined more fully in a later section, after illus- 
trating the more general effects of the confusion be- 
tween the level of significance appropriately assigned 
to a specific test, with the frequency of occurrence of 
a specified type of decision. 


2. “Student” 's Test 


In the test of significance due to "Student" 
(W. S. Gossett), and generally known as the /-test, 
the data are taken to consist of N values of a single 
observable variate x, which are to be interpreted on 
the hypothesis that they are independent values of 
a variate normally distributed, with both mean and 
variance unknown, and of which no probability 
statements a priori are available. The two statistics 


st = Sea)? (30) 


are known to be jointly Sufficie : 3 
of the true mean, y, Ed the C e pex 
sampling distributions in random samples of N 

be expressed exactly in terms of pee E 
parameters. Thus, as was shown by e. d im 
mean $ has a Normal distribution about the m: 
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mean, p, with the population variance divided by N. 
Le. the frequency element is 


uae medi, (33) 
270 


while the distribution of s?, due first I believe to 
Helmert? (1875) is independent of », and of x, though 
involving o?. If w stand for 


(N—1)s? z 
wc UD (32) 
it takes the form 
N-S = 
ty? edu, (33) 
N-3, 
2e 


an Eulerion distribution, identifiable with that of 
x? for N —1 degrees of freedom, by the equivalence 


u=}x?. (34) 

On the basis of these distributions “Student” set 
himself to ascertain the exact sampling distribution 
of the ratio between the sampling error of the mean, 
and its standard error as estimated, namely 


VN(—u) (35) 


S 


t= 


when 4 and s are calculated from a finite sample, and 
he was successful in establishing the frequency 
element 
N-—2, 
7 me opm a] Mint x 
ep (x *N ) 


80 STATISTICAL METHODS AND SCIENTIFIC INFERENCE 


a distribution depending on N only, independent of 
both parameters of the Normal distribution sampled, 
and which can, therefore, be tabulated so as to give 
the value of ¢ exceeded in absolute value, with any 
given probability, and for any number, s— (N — 1), of 
the degrees of freedom. 

It will be recognized that “Student” 's distribution 
allows of induction of the fiducial type, for the 
inequality 


PEE 
[SS ETAT is, (37) 


will be satisfied with just half the probability for 
which ¢ is tabulated, if ¢ is positive, and with the 
complement of this value if ¢ is negative. The 
reference set for which this probability statement 
holds is that of the values of p, # and s corresponding 
to the same sample, for all samples of a given size of 
all normal populations. Since * and s are jointly 
Sufficient for estimation, and knowledge of p and o 
a priori is absent, there is no possibility of recognizing 
any sub-set of cases, within the general set, for which 
any different value of the probability should hold. 
The unknown parameter p has therefore a frequency 
distribution a posteriori defined by “Student” 's 
distribution. 

Although his was the first exact test of significance, 
characteristic of the modern period, “Student” did 
not go so far as to claim that he was introducing a 
new mode of reasoning, and perhaps would have been 
unwilling to believe it had he been told So; for he was 
only applying his own good sense to a logical situation 
with which he was quite familiar. He was usually 
content to leave the inference in the form of a test of 
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significance, namely that, on the hypothesis, for 
example, that the true mean of the population was 
zero, the value of ¢ observed, or a greater value, 
would have occurred in less than 1% of trials, and was 
therefore significant at, for example, the 1% level. 
It does appear, however, at one point, that he 
certainly was thinking of the unknown mean as 
having, in the light of the observations, a definable 
frequency distribution, for in the extreme case of only 
two observations, x, and x,, when the distribution of 
t reduces to Cauchy's distribution, 


I dt 
A e (38) 
and s? = x tl?, (39) 
so that 
2(x—u) 
A aie o 
[x1— 2l (40) 


or exactly + I or —1 if » takes one or other of the two 
observed values, he does point out that these are 
the quartiles of Cauchy's distribution, or the points 
cutting off a quarter of the area on each side. 

Thus by integration of his general formula 
“Student” had shown, in this case, that the mean of 
the population sampled had exactly the probability 
of one half of lying between the two values observed; 
this conclusion is notable as illustrating à mode of 
inference entirely independent of the form of the 
distribution, save that it be continuous, namely that 
as each observation has a half chance of being above 
and a half chance of being below the median, andas 
these chances are independent, it could have been 
demonstrated that the median should lie between 

F 


82 STATISTICAL METHODS AND SCIENTIFIC INFERENCE 


the two observations, these being our sole source 
of information about its position, with probability 
exactly one half. This is a typically fiducial argu- 
ment, which would have been vitiated by the exist- 
ence of information a priori. "Student", indeed, 
guards himself against this possibility by stipulating 
that his sample "itself affords the only indication of 

' the variability". I take this to make clear also that 
it is not one of an objective series of similar samples 
from the same population existing in reality, though 
it can be regarded by an act of imagination as one of 
a hypothetical reference set. 

This case is more complex than that dealing with 
radioactive emission discussed in the last Chapter 
(Section 3.3), for here the simultaneous estimation of 
two parameters is required. It is more simple than 
will generally be the case in statistical work, for in 
this case no characteristic of the sample (i.e. of the 
whole body of observations available) can be found to 
define a subset to which our sample belongs, and which 
might exhibit a different, and more relevant, 
frequency distribution. It is this simplicity that has 
deceived those writers who have considered this one 
alone of the practically useful tests of significance into 
ignoring such subsets, or thinking that when such 
subsets are available their existence can be ignored; 
and that a mere consideration of repeated sampling, 
in one of the many forms this may take, is sufficient 
to specify the level of significance appropriate. 


3. The case of linear regression 


A case which illustrates well how misleading the 
advice is to base the calculations on repeated 
sampling from the same population, if Such advice 


be — 
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were taken literally, is that of data suitable for the 
estimation of a coefficient of linear regression. 
Suppose we have N pairs of values (x, y) constituting 
the numerical data, and suppose also that it is given 
that, for each value of x, the values of y are normally 
distributed; that the mean y for any value of x is a 
linear function of that variable, 

E,(y)=Y,=a+Bx , (41) 
and that the variance of each distribution, though 
unknown numerically, is known to be the same for all 
values of x; i.e. 

E,(y—Y,)?=0?. (42) 
The distribution of x may also be given, with or 
without parameters, but this information is, as will 
be seen, irrelevant. 

At least since the time of Gauss it has been known 
that if A, B, C stand for the sums of squares and 
products 
S(x—x)*, S(x—4)(y—9), S(y-9? , (43) 
then the best estimate of £, the slope of the regression 
line, is 


B 


DEC (44) 
The best estimate of c? is 
I B? 
s= yzl- De (45) 


and, for samples having the same fixed value for A, 
the estimates b will be normally distributed about 
the true value, 8, with sampling variance 


vo)-*. (46) 
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We may therefore identify "Student" 's ¢ with 


icva h, (47) 


where fy is any theoretical value of the coefficient, 
such as zero, proposed for comparison. Equally, the 
unknown f may be assigned a known frequency 
distribution in the light of the observations, based on 


SA 8) 
B=b+ vmi (4 
where ¢ is distributed in "Student" 's distribution for 
(N—2) degrees of freedom. ». 
This simple and well-known form of analysis is not, 
I believe, disputed, save in the interpretation of the 
fiducial inferences. It should be notéd none the less 
that it does violate the criterion of judging by 
repeated sampling, for in repeated sampling from the 
bivariate distribution of x and y, the value of A 
would vary from sample to sample. The distribution 
of (b—B) would no longer be normal, and, before we 
knew what it was, the distribution of 4, which in 
turn depends on that of x, would have to be investi- 
gated. Indeed, at an early stage Karl Pearson did 
attempt the problem of the precision of a regression 
coefficient in this Way, assuming x to be normally 
distributed. The right way had, however, been 


ribution, appropriate 
for samples of rather small numbers of Observations. 


To judge of the precision of a given value of b, by 
reference to a mixture of Samples having different 
values of A, and therefore different Precisions for the 
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values of b they supply, is erroneous because these 
other samples throw no light on the precision of that 
value which we have observed. If we must think in 
terms of random sampling, it is only that selection of 
random samples which agree exactly with our own in 
respect to.the value of 

A=S(x—)*, (49) 
that is relevant to assessing its real precision. Sucha 
selection might be quite inaccessible in sampling for 
acceptance. Further subdivision, so far as to specify 
the N values of x individually, could have been con- 
sidered, but would be required if, and only if, it made 
any difference. 


4. The two-by-two table 

Although in the case of simple regression there has 
not, so far as I know, been any tendency to calculate 
erroneous values through mistaking the logical nature 
of the test, at least since the time of Karl Pearson, 
the very similar and equally fundamental test of 
proportionality in a two-by-two table has on more 
than one occasion become a matter of dispute. The 
data in this case consist of a number of cases doubly 
dichotomized, as people may be classified as male or 
female, or, again as tasters or non-tasters of phenyl- 
thiocarbamide, which a proportion of people cannot 
taste at concentrations which to others taste dis- 
tinctly bitter. The statistical test is intended to find 
out whether the four frequencies observed are in 
proportion, or, in other words, whether the two 
classifications are independent. It will be noticed 
that, as in most tests, what is here to be rejected by a 
significant result is a whole class of hypotheses. 
These wil have various values for the expected 
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marginal frequencies, each, however, having pro- 
portional frequencies in the contents. 

In this problem, as was pointed out concurrently 
by Yates?» and Fisher?? in I934, a great simplification 
is available in that the subset of possible samples 
having the same marginal frequencies as that observed 
wil have, whenever the two classifications are 
independent, the same frequency distribution. So, if 
in the records of a controlled experiment we find that 
three treated animals all die, and three control 
animals all survive, the two-by-two table 


Died Survived Total 


Treated ; : 3 o 3 
Control ; : o 3 3 
Total : : 3 3 | 6 


is recognized to be one of a subset of possibilities, 
being double dichotomies having the same marginal 
totals, represented briefly by 


9013 ir Jen t5 


Frequency x 9 9 I +20 


Successful out of twenty 
possibilities, all equally Probable on the view that 
the treatment is without effect. 

In other cases of compound hypotheses it does not 
always occur that when all the Possibilities to be 
excluded are excluded by a single test, the level of 


TESTS OF SIGNIFICANCE - 87 


significance is the same as the frequency of erroneous 
exclusion, for all the possible hypotheses to be tested. 
It may then be that when some particular case of the 
compound hypothesis is true, the proportion of 
samples capable of dismissing the whole range of 
hypotheses under test will not be so great as the level 
of significance would suggest. It would indeed be 
unreasonable in general to expect it to be otherwise; 
it is therefore worth calling special attention to the 
exceptional feature of the subset available for 
testing significance in the two-by-two table, that the 
frequency of rejection is the same for all the simple 
hypotheses included. 

On two occasions (Wilson, r941,^?'? Barnard, 
19455) distinguished mathematical statisticians have 
tried to improve the test by including in the 
enumeration cases in which the marginal totals differ 
from those in the sample observed. On both 
occasions, after discussion and elucidation of the 
logical basis of the test, these attempts have been 
abandoned. The argument was well illustrated by 
Barnard using the case discussed above, wherein, if 
we had to consider repetition of the experiment on 
the assumption that the treated and control animals 
have the same expectation # of dying, the frequency 
of what has been observed is exactly 


Pe. (50) 


For any real values of p and q, adding to unity, 
this is a small fraction. Its maximum value, 
attained when p=}, is only 1/64. So, if a repetition 
of the experiment were the right criterion, as has 
been very dogmatically asserted, significance at least 
at this level could be claimed. It was on this basis 
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asserted that the method of Neyman and Pearson, 
relying on the formula of repeated sampling, had led 
to "à much more powerful test than Fisher's". The 
64 cases enumerated in this argument include, how- 
ever, not only the 20 having the same marginal totals, 
but 44 others belonging to four other Subsets, of 
which it should be noted (i) that what has been 
observed does not belong to any of these four sub- 
sets, and (ii) within each of these questionable sub- 


sets there is no configuration which could be judged 
significant. 


therefore do no other than enhance the apparent 

significance by inflation of the denominator, 
Professor Barnard has since then frankly avowed 

that further reflection has led him to the same con- 


5. Excluding a composite hypothesis 


It Was remarked above that very commonly a test 
of significance is used to exclude any one of a class of 
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hypotheses, or, as it is sometimes called, a composite 
hypotheses. Some of the examples given have the 
exceptional feature that a single test can be found 
appropriate for the purpose, in which, were any of 
this class of hypothesis true, the criterion of rejection 
would be satisfied with the same frequency as the 
appropriate level of significance. The errors due to 
equating these concepts which have been indicated 
so far have been only of the kind which flow from 
ignoring the appropriate subset of cases to which the 
observed sample belongs, and seeking to ascertain the 
frequency of occurrences in a more inclusive set con- 
taining elements of a different kind, the variations of 
which are irrelevant to the observed case under 
consideration. 

Composite hypotheses in general, however, contain 
another reason for ignoring the assumption that the 
frequency of rejection should be equated to the level 
of significance; which reason flows from the very fact 
that they are composite, i.e. that two or more distinct 
possibilities are to be rejected, each on sufficiently 
strong evidence. It may be that samples of the kinds 
available do not so easily dismiss the whole range of 
hypotheses to be tested even at a moderate level of 
significance. 

A simple, though artificial, example is the following. 
A number of cards is made up into a pack, in which 
the proportions of the four suits are unknown. The 
composite hypothesis to be tested is that the pro- 
portions of the two red suits do not both exceed 25%. 
The data on which the test is to be based are a sample 
of 100 random draws each followed by replacement 


and reshuffling. 
The possibility that there were no more than 25% 
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Hearts could be excluded at a reasonable level of 
significance, if 34 Hearts appeared in 100 chosen. If 
the true proportion were 25%, the probability of 
observing 34 or more is found to be 2:75995. How- 
ever, the composite hypothesis is disproved only if it 
is demonstrable that the proportion of Diamonds also 
is more than 25%, and this requires that the sample 
should contain at least 34 Diamonds. It is not 
difficult to anticipate that both these conditions 
together will be fulfilled very rarely, even in the case 
in which both Hearts and Diamonds contribute a full 
quarter to the material sampled. In fact, an apparent 
disproof of the composite hypothesis at the moderate 


It is, of course, no inconvenience that the frequency 
of rejecting the hypothesis in some cases when it is 
lation indicates also 


y occupy somewhat 
more than 25% of the material sampled, it would not 


level of Significance, if neither of them were greatly 
In excess, Sufficiently large samples could indeed 
a demonstration probable; but the 


untenable. The 
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is not to be measured by the frequency observed in 

repeated sampling from the same population" have 
not, on the whole, been well received by the authors 
of this formula. E. S. Pearson (Biometrika, 1947)? 
quotes me as writing (Sankhya, 1945)": 

i In recent times one often repeated exposition of the tests of 
significance, by J. Neyman, a writer not closely associated 
with the development of these tests, seems liable to lead 
mathematical readers astray, through laying down axiomati- 
cally, what is not agreed or generally true, that the level of 
significance must be equal to the frequency with which the 
hypothesis is rejected in repeated sampling of any fixed 
population allowed by hypothesis. This intrusive axiom, 
which is foreign to the reasoning on which the tests of 
significance were in fact based seems to be a real bar to 
progress. ... 

On this E. S. Pearson remarks, “But the subject 
of criticism seems to me less an intrusive mathe- 
matical axiom than a mathematical formulation of a 
practical requirement which statisticians of many 
schools of thought have deliberately advanced”. It 
would, however, be difficult to find "schools of 
thought" other than that of Neyman and Pearson 
themselves, which have deliberately advanced any- 
thing of the kind. The rather wooden attitude 
adopted by this school seems to stem only from their 
having committed themselves to an unrealistic 
formalism. 

Obvious as it might seem, it is evidently necessary 
to point out that it is no remedy to construct a test 
of significance with a firm intention that the hypo- 
thesis shall be rejected when true in a fixed proportion 
of trials. For this may well be mathematically 
impossible, for the whole range of cases; and con- 
sequently, what is of much greater importance, a test 
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which is made to conform in one case may go widely 
astray in others. For example; if the test were 
chosen in the example discussed above, that the sum 
of the numbers of the two red suits observed in a 
sample of roo should exceed 60 (which, if both red 
suits contributed just 25%, would reject the hypo- 
thesis to be tested with a "satisfactory" frequency, 
2-623%), it would reject it very much more frequently 
in other cases, in which the null hypothesis was true, 
as for example if Hearts were 50% and Diamonds only 
25%. Such a “test” while purporting to examine the 
truth of the null hypothesis would in fact reject it 
readily without the evidence of the observations being 
appreciably adverse. At least one of the Tables 
published by Professor E. Pearson and H. O. Hartley 
is indeed misleading in just this way (No. II)? The 
editors' justification is: “Tt is the result of an attempt 
to produce a test which does satisfy the condition that 
the probability of the rejection of the hypothesis 
tested will be equal to a specified figure", The 
logical basis of a test of significance as a means of 
learning from experimental data is here completely 
overlooked; for the potential user has no warning 
of the concealed laxity of the test. 

In fact, as a matter of principle, the infrequency 
With which, in particular Circumstances, decisive 
evidence is obtained, should not be confused with the 
force, or cogency, of such evidence, 


6. Behrens' Test 


The immediate effect of 
use came to be appreciate 
significance for the deviati 
expected value, of the me 


“Student” ’s work, as its 
d, was to supply a test of 
on, from some theoretically 
an of a sample of observa- 
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tions, or of a regression, even though based on a small 
sample. It was quickly shown that the same Table 
could be used for the comparison of means, or 
regressions, based on small samples, provided the two 
sets of observations to be compared had the same 
precision, or to take a rarer but more general case, 
precisions in a known ratio. 

These early results covered immediate practical 
requirements rather fully, for with large samples 
capable of supplying, on internal evidence, accurate 
estimates of precision, the large-sample procedure of 
estimating the two sources of error independently 
could be relied on; and, in a great deal of practical 
experimental work with small samples, although 
different lots of material might in reality have 
somewhat unequal variances, there were good reasons 
for supposing the real differences to be small com- 
pared with the errors of estimation from the small 
samples individually; so that better comparisons 
would be obtainable by pooling the variances of the 
different lots. The mathematical problem of the 


comparison of means of samples, not only small in 
size, but for which there is no reason 4 priori to 
e differences in precision, 


dismiss the largest imaginabl 

was of mathematical interest, and potential experi- 
mental importance, though it is difficult to find 
realistic data which present this problem. 

For samples from a single population, the effect of 
eliminating the unknown variance, o?, by "Student" 's 
method, on the distribution of the error of the mean, 
is to replace, in the specification of this error, 


Z (51) 


EE, 


v. 
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where x is normally distributed with unit variance, 
but o is unknown, by 


ES 2 

ve (52) 
where £ is distributed in “Student” 's distribution, for 
the appropriate number of degrees of freedom, 
n(=N—1), and s is the estimate of o available from 
n degrees of freedom. 

For two samples from populations having a 
common mean, the deviations will be independent, 
and the data will Supply values s,, based on Ny 
degrees of freedom, and S; based on. The difference 
between the observed means is then the sum (or 
difference) of the two deviations from the true mean, 
so that on the null hypothesis considered, namely 
that the two population means are equal, we have 

VE MENTRE NUES. 

Tii, VN,” JN (53) 
where /, and ¢, are distributed independently in the 
two “Student” distributions, 

If the frequency is small, such as 1%, that the 

ight, which has a known dis- 


notation, was given by W.-U Behrens! in 1929; 
ene also gave a short numerical table. A paper 
of m 


and somewhat extended 


» 8, 12, 24 and oo, and values 
(=tan 8) for values of 0 of 0°, 15°, 
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30°, 45°, 60°, 75? and 90°. Apart from very small 
samples this was sufficient to make the test easily 
available. In 1941 the author gave!" asymptotic 
expansions for calculating probabilities with accuracy 
in any particular case, and a further range of tables 
for the case when either s, or », is large. The 
numerical values seem to make allowance nicely for 
the fact that a composite hypothesis, in which all 
ratios of o/c, are possible, is being tested, for it is 
required to set a limit which will rarely be passed by 
random samples of populations having the same 
mean, whatever may be the true variance ratio 


(Statistical Tables V, and V;).?? 
In the extreme case in which both samples are of 


only two readings such as x; and x, for the first 
sample, and y, and y; for the second, Behrens' test 
takes the simple form of calculating 
XybXy—yi—a2 E 

Bici n 
where the level of significance is determined by 
giving T “Student” 's distribution for one degree of 
freedom, so that the level of significance is 


E tan- (1/T) . (55) 


This extreme case, academic as it is, is particularly 


suitable for exhibiting the logic of the test. In 1937 
I gavet the frequency distribution of T in repeated 
samples from populations having a fixed variance ratio, 
: $= 03/91 (56) 
in the exact form, 


ag SE ol) [DE aie 
zd z(r4-1?) Wiz 44-1?) ar Ju | » 697) 
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with the probability integral, 

NL TV 4 iis T } 

2 "|. VGrdjrTS Vr qm , 
5 


from which it is evident that, if we take the 5% value 
of |T| as 


tan 85? 30'— 12-7062 

this value will be exceeded in 596 of such repeated 
trials only at the limits $—0, or co, while midway 
between these, if ¢ were equal to unity, the criterion 
would be exceeded in less than I% of random trials. 

This circumstance, indeed, caused me no surprise, 
for the reference set in ( 58) has not been limited to the 
subset having the ratio S,/Sp observed, but was eagerly 
seized upon by M.S. Bartlett, as though it were a defect 
in the test of significance of the composite hypothesis, 
that in special cases the criterion of rejection is less 
frequently attained by chance than in others, On 
reflexion I do not think one should expect anything 
else, and it was perhaps only because at this time 
Bartlett was confidently putting forwardan alternative 
attempt to solve the same problem that he made so 
much of a circumstance which is, indeed, generally to 
be expected. 


7- The “randomization test” 


How important it seemed to Bartlett that, what- 
ever the true nature of the population sampled, the 
null hypothesis should be rejected when true with 
exactly the frequency Suggested by the level of 
significance, is shown by the fact that he did for a 
time put forward as an alternative, presumably 
thought to be better than Behrens’, a test involving a 
deliberately introduced element of hazard, 
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„Observing the formal resemblance of the problem 
discussed by Behrens, in its extreme form when both 
samples are of two values only, to a paired comparison 
of “Student” ’s type with only two pairs, it is easy to 
see that if in addition to Behrens’ value T, we 
calculate also 


REPASA J! (59) 
|| — cal Lyall 


then, if one of the values T, T” is chosen by an equal 
chance, the one chosen will exceed an appropriate 
in absolute value with a 
nificance adopted; 
tios of the popula- 


criterion such as 127, 
frequency equal to the level of sig 
and this for all possible variance ra 
tion sampled. 
This proposal, 
abandoned (thoug 


which has perhaps now been 
h at the time an equally faulty 
proposal was quickly put forward by Neyman) has 
two conspicuous objections, one of which is of 
general importance in that it applies to all “random- 
ization tests” in the Natural Sciences. Namely, that 
if T and 7” lie on opposite sides of the criterion, T 
being always the larger, and a coin is thrown to 
decide which shall be chosen, it is then obvious at 
the time that the judgement of significance has been 
decided not by the evidence of the sample, but by 
the throw of the coin. It is not obvious how the 
research worker is to be made to forget this circum- 
stance; and it is certain that he ought not to forget it, 
if he is concerned to assess the weight only of objective 
observational facts against the hypothesisin question. 
A real experimenter, in fact, so far from being 
willing to introduce an element of chance into the 


G 
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formation of his scientific conclusions, has been 
steadily exerting himself, in the planning of his 
experiments, and in their execution, to decrease or to 
eliminate all the causes of fortuitous variation which 
might obscure the evidence. " 

Consequently, whereas in the “Theory of Games 
a deliberately randomized decision?9 (1934) may 
often be useful to give an unpredictable element to 
the strategy of play; and whereas planned random- 
ization?” (1935-1953) is widely recognized as essential 
in the selection and allocation of experimental 
material, it has no useful part to play in the formation 
of opinion, and consequently in the tests of signifi- 
cance designed to aid the formation of opinion in the 
Natural Sciences. 

The second and specific objection to Bartlett's T", 
as a test of significance, is that it does not increase or 
decrease monotonically for changes in the weight of 
the evidence. For example, if y, —y, is less than 
%,—%,, then an equal change in y, and Yz, taking 
them farther apart, will diminish the denominator 
of T", and actually increase its value; so indicating a 
higher level of significance, due to a greater dis- 
crepancy between two parallel Observations. 

In fact a practical worker who had calculated T, 
and T’, could only regard them as providing evidence 
of significance, if both exceeded the minimum level, 


and since T” is never less than T, this implies simply 
the use of Behrens' test. 


It is an indication of 


randomization test an 
of the n, 
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equal number from the other sample, as can be done 
in no less than 
Ny! 
(na =m)! (69) 
different ways, was put forward by J. Neyman as a 
general solution of the problem. That "solution" 
also has never, I believe, been applied in practice. 


8. Qualitative differences 

The examples elaborated in the foregoing sections 
He numerical discrepancies arising from the rigid 
ormulation of a rule, which at first acquaintance it 
seemed natural to apply to all tests of significance, 


constitute only one aspect of the deep-seated 
difference in point of view which arises when Tests 


of Significance are reinterpreted on the analogy of 
Acceptance Decisions. It is indeed not only numeric- 
ally erroneous conclusions, serious as these are, that 
are to be feared from an uncritical acceptance of this 


analogy. 

An importan 
while the state of opinion 
significance is provisional, and 
confirmation, but of revision. 
cedure is devised for a whole 
particular thought is given to ea 
nor is the tester’s capacity for lea 
test of significance on the other han 
aid the process of learning by observational ex 
ence. In what it has to teach each case is unique, 
though we may judge that our information needs 


supplementing by further observations of the same, 
or of a different kind. To regard the test as one of a 
iven have shown how 


series is artificial; the examples 8! 


t difference is that Decisions are final, 
derived from a test of 
capable, not only of 
An acceptance pro- 
class of cases. No 
ch case as it arises, 
ming exercised. A 
d is intended to 
peri- 
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far this unrealistic attitude is capable of deflecting 
attention from the vital matter of the weight of the 
evidence actually supplied by the observations on 
Some possible theoretical view, to, what is really 
irrelevant, the frequency of events in an endless series 
of repeated trials which will never take place. 

The concept that the scientific worker can regard 
himself as an inert item in a vast co-operative 
concern working according to accepted rules, is 
encouraged by directing attention away from his duty 
to form correct scientific conclusions, to summarize 
them and to communicate them to his scientific 
colleagues, and by stressing his supposed duty 
mechanically to make a Succession of automatic 
"decisions", deriving spurious authority from the 


very incomplete mathematics of the Theory of 
Decision Functions. 


t this responsibility can 


be delegated to a giant computer programmed with 


however, really been advanced (Neyman, 1938) that 


Inductive Reasoning does not exist, but only 
Inductive Behaviour"! 


A misconception having some troublesome con- 
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Sequences was introduced?! by Neyman and Pearson 
In 1933, shortly after they had learnt of the possibility 
of deriving probability statements and therefore 
limits of significance by the fiducial argument, which 
had been published in the same journal, the Proceed- 
ings of the Cambridge Philosophical Society, in 1930.” 
Instead of perceiving that my method was appropriate 
to the absence of knowledge a priori, and, although I 
had not made this clear, would have been invalidated 
by the presence of such knowledge, Neyman and 
Pearson speak of my results as though they were a 
kind of “greatest common measure" of the inferences 
which could be made for all possible types of informa- 
tion a priori. In fact their paper opens as follows: 
In a recent paper?! we have discussed certain general 
principles underlying the determination of the most efficient 
tests of statistical hypothesis, but the method of approach 
did not involve any detailed consideration of the question of 
4 priori probability. We propose now to consider more fully 
the bearing of the earlier results on this question and in 
Particular to discuss what statements of value to the 
Statistician in reaching his final judgment can be made from 
an analysis of observed data, which would not be modified 
by any change in the probabilities a priori. In dealing with 
the problem of statistical estimation,* R. A. Fisherhasshown 
how, under certain conditions, what may be described as 
rules of behaviour can be employed which will lead. to 
results independent of these probabilities; in this connection 
he has discussed the important conception of what he terms 
fiducial limits.?^?? But the testing of statistical hypotheses 
cannot be treated as a problem in estimation, and it 1s 
necessary to discuss afresh in what sense tests can be 
employed which are independent of a priort probability laws. 


This early misconception has led other writer: 
“Inverse Probability.” 


s to 


* My paper had, however, been entitled 
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seek for inferences independent of a priori laws, 
whereas seeing that Bayes’ theorem is based upon 
supposedly exact knowledge of probabilities a priori, 
and that these probabilities can be made to appear 
explicitly in the result, none but trivial conclusions 
can be common to all cases. It is perhaps some sort 
of recognition of this that makes these authors 
ascribe to me “rules of behaviour", which I had not 
mentioned at all, whereas I had written in quite con- 
ventional terms which refer to reasoning processes, 
such as "learning by experience” and the “probability 
of causes". The logical distinction must in any case 
be stressed between possessing no information of 
a certain kind, and Possessing such information, 
although it may be provisionally expressed in a 
generalized notation. The confusion of these situa- 
tions is a serious trap, especially for mathematicians 
without experience in the Sciences. 

It is important that the scientific worker introduces 
no cost functions for faulty decisions, as it is reason- 
able and often necessary to do with an Acceptance 
Procedure. To do so would imply that the purposes 
to which new knowledge was to be put were known 
and capable of evaluation. If, however, scientific 
findings are communicated for the enlightenment of 
other free minds, they may be put sooner or later to 
the service of a number of purposes, of which we can 
know nothing. The contribution to the improvement 


variety of purposes by a great vari 
groups of men, will be facilitated. 
in a position to censor these in ad: 


ety of men, and 
No one, happily, is 
vance. As workers 
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in Science we aim, in fact, at methods of inference 
which shall be equally convincing to all freely 


reasoning minds, 


entirely independently of any 


intentions that might be furthered by utilizing the 
knowledge inferred. 


I. W.-U. Behrens 
(1929). 


2. F. R. Helmert 
(1875). 


3. K. Pearson (1900). 


4. K. Pearson 
(1925). 


5. "Student" (1908). 


6. R. A, Fisher 
(1924). 


7. E. B. Wilson 
(1941). 


8. G. A. Barnard 
(1945). 

9. E. B. Wilson 
(1942). 
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CHAPTER V 


SOME SIMPLE EXAMPLES OF 
INFERENCES INVOLVING PROBABILITY 
AND LIKELIHOOD 


1. The logical consequences of uncertainty 


The concepts sketched in Chapter III have arisen in 
the study of numerical observations in the Natural 
Sciences; they are intended for use in the inferences 
by which progress in the sciences is guided. Since 
the reasoning is quantitative it involves mathemat- 
ical operations, which need not, however, be of a 
very complicated kind. Indeed, the examples may 
be confined to simple cases, though often at the 
expense of being scientifically trivial, for it is not 
the mathematics but the logical nature of these 
concepts that requires to be exemplified. Since the 
reasoning is inductive, the development for it of 
appropriate mathematical operations seems to run 
counter to the view that all mathematics can be 
reduced to a single and wholly deductive system. 
Admittedly deductive processes play a predominant 
part in mathematics, yet it is difficult to admit that 
mathematics is less than the whole art of exact 
quantitative reasoning, and as such must extend 
beyond the domain of deduction proper. 

The theory that all mathematics could be reduced 
to a purely deductive system, which was popular 
about the beginning of this century, has, moreover, 
in the meantime suffered, with the development of 
axiomatic studies, some rather severe setbacks, It is 
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common ground that the consistency of the axio- 
matic basis of a deductive system is essential for the 
reliability of its consequences. It has been formally 
demonstrated that a system admitting one contra- 
diction must admit all, in the sense that any proposi- 
tion whatever can be deduced from it, by formally 
rigorous processes. The non-existence of contra- 
dictory consequences is thus a burning question for 
the whole superstructure. Moreover, it has been 
proved that the non-existence of such contradictions 
can never be demonstrated on the basis of the axioms 
of the system themselves. It would be rather 
absurd, indeed, to imagine that any chain of theorems, 
derived from a given axiomatic basis, could disprove 
a possible property of that basis, when it is known 
that, if it had that property, these same theorems 
could certainly be deduced from it. For the 
possibility of proving such theorems does not depend 
upon the truth of what they assert. It would seem, 
therefore, that the validity of a purely deductive 
system has at best the same logical status as has a 
scientific theory, which has not yet been found in any 
case to be in conflict with the observations. As such 
it appears to be solidly based on a well-tested 
induction. 

The axiomatic theory of mathematics has not been 
taken very seriously in those branches of the subject 
in which applications to real situations are in view. 
For, in applied mathematics, it is unavoidable that 
new concepts should from time to time be introduced 
as the cognate science develops, and any new defini- 
tion having axiomatic implications is inevitably a 
threat to the internal consistency of the whole 
system of axioms into which it is to be incorporated. 
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We have seen that the introduction of the concept 
of probability has caused just such an axiomatic 
disturbance. In its applications, therefore, mathe- 
matics cannot easily be reduced to a closed and static 
System, but has to develop with the development of 
human thought, of which it is an important vehicle. 

The purpose of deductive processes is to reveal, 
or uncover, the latent consequences of the axiomatic 
basis adopted. Nothing essentially new can be dis- 
covered, but the coherence of the whole structure 
can be usefully demonstrated, and its consistency to 
some extent tested. In the future these processes will 
perhaps be carried out better by machines than by 
men. The axiomatic basis, in any case, is tailored 
with a view to its deductive consequences, and it is 
this which gives it its real utility. Deductive argu- 
ments are, in fact, often only stages in an inductive 
process. For example, Bayes’ theorem, on the data 
postulated, is strictly deductive, nevertheless we may 
include it among the processes of induction, on the 
ground that the probability statement a priori on 
which the argument is founded must in the real 
world have an inductive, or factual, rather than an 
axiomatic, basis. 

On the contrary, the purpose of inductive reason- 
ing, based on empirical observations, is to improve 
our understanding of the Systems from which these 
observations are drawn. The appropriate mathe- 
matical forms for reasoning of this type have been 
becoming clear during the present century owing to 
the widespread application of statistical methods to 
scientific data, and of increasing understanding of 
the principles of the design of experiments, One of 
the obstacles which has had to be overcome is the 
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tendency to impose on inductive thought the con- 
ventions and preconceptions appropriate only to 
deductive reasoning. 

The governing characteristic of inductive reason- 
ing is that it is always used to arrive at statements of 
uncertainty, and that logical situations are recogniz- 
able in which different types or degrees of un- 
certainty require to find rigorous expression. It has 
been thought that the Theory of Mathematical 
Probability, in spite of the fact that a probability 
Statement is in reality a statement of a specific type 
of uncertainty, could be included among strictly 
deductive processes. This has seemed possible largely 
because many mathematical treatises have adopted 
a formal and abstract treatment in which the element 
of uncertainty is inoperative, just because applica- 
tions to the real world are avoided. 

The logical characteristic, which has been too 
much overlooked, of all inferences involving un- 
certainty is that the rigorous specification of the 
nature and extent of the uncertainty by which they 
are qualified must in general involve the whole of the 
data, quantitative arid qualitative, on which they 
are based. 

As soon as it is regarded realistically it is seen that 
the concept of Mathematical Probability shares this 
requirement. In a statement of probability the 
predicand, which may be conceived as an object, as 
an event, or as a proposition, is asserted to be one of a 
set of a number, however large, of like entities of 
which a known proportion, P, have some relevant 
characteristic, not possessed by the remainder. It is 
further asserted that no subset of the entire set, 
having a different proportion, can be recognized. 
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If, therefore, any portion of the data were to allow 
of the recognition of such a subset, to which the 
predicand belongs, a different probability would be 
asserted using the smallest such subset recognizable. 

When no further subset is recognizable, which 
can be known only by an exhaustive scrutiny of the 
data, the predicand is spoken of as a random member 
of the ultimate set to which it belongs. An imagined 
process of sampling in which a succession of pre- 
dicands are identified may be used to illustrate the 
relation between the proportion expected to be 
observed in the sample, and the primary propor- 
tion required to specify the set, now to be identified 
with the population sampled. Rather unsatisfactory 
attempts have been made to define the probability 
by reference to the supposed limit of such a random 
sampling process. 

Difficulty has sometimes been expressed when the 
reference set, or the population sampled, is said to be 
infinite. The definition and consequent calculations 
can, however, be applied to any finite set however 
large, and the limit of these results, where the 
number in the set is increased indefinitely, is all 
that is meant by the results of sampling from an 
infinite population. The clarity of the subject has 
suffered from attempts to conceive of the "limit" 
of Some physical process to be repeated indefinitely 
in time, instead of the ordinary mathematical limit 
of an expression of which some element is to be made 
increasingly great. 

The following sections are intended to illustrate 
the kinds of reasoning, and concurrent mathematical 


operations, appropriate to various types of un- 
certainty. 
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2. Bayesian prediction 

Expressed in terms of the hypothetical ratio £ : q, 
Bayes' classical inference (from the uniform dis- 
tribution a priori he assumed) is that the probability 
distribution of 5 is exactly 


a+b+1)! 

Co e do, (om 
in the light of the empirical observation of a successes 
out of (a+b) trials. Alternatively, the hypothetical 
parameter $ (or g) can be eliminated, and the 
inference expressed wholly in terms of the prob- 
ability of future observations. 

For example, if c+-d further trials were to be made 
with the same causal system, the probability, for 
each possible value of p, of observing just ¢ successes 
is 

(c+d)! oa 6 
clade (62) 

The average of this fraction, over all possible 

values of #, is then found by integration to be 


(a+b+1)! (atc)! (b+d)! (cta)!. (63) 
alb! (a+b+e+d+1)! cld! ^ 


this represents the probability, in the light of the 
Previous experience, of obtaining c successes m 
(c--d) further trials, the hypothetical parameters 
É, q having been eliminated. It connects future 
rational anticipations directly with the experience 
on which they are based, without the mediation of 
hypothetical quantities. Itispostulated only that the 
two samples are drawn from the same constant 
Population of possibilities, and that Bayesian know- 


ledge a priori is available. 
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It may be noticed that the last factor in the ex- 
pression developed above, 
(c+d)! 
cld! (64) 
stands only for the binomial coefficients forming the 
last line, or base, of Fermat’s arithmetical triangle; 
but Sted)! ,. 6 
à rm (65) 
is not the only polynomial in p,q, the value of which 
is constantly equal to unity. If, in fact, the triangle 
is extended to any chosen boundary, as for example 
in the diagram (Fig. 3), the thirteen totals outside 
the boundary are the coefficients w (c, d) of a 
polynomial 
X o(c, dpi = Pit 4990-1 1858934 ++ (66) 
of which the value is unity for all values of f. 


Fic. 3. THE ARITHMETICAL TRIANGLE Nes 
ED 
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Then, based on previous experience of a successes 
out of a+b, we may infer the probability of reaching 
the terminal value (c, d) to be 

(a--b--x)!.— (atc)! (b+d)! . 
ater lrbzcrdpbdio d ND 
if a subsequent trial were made with these end-points. 


3. Fiducial prediction 

When there are no data a priori and the fiducial 
argument is available, the parametric values may 
equally be eliminated, and the appropriate inferences 
expressed as probability statements about future 
observations. In the case of the radioactive source 
considered in Section 3.3, if the total length of N, 
measured time intervals were X,, we previously drew 
the inference that 

Xin, = 20X, , (68) 

where 6 is the true rate of emission, and x? is dis- 
tributed as is the sum of 2N, independent squares of 
normal deviates each having unit variance. 

If, now, a second series of N, time readings were 
to give a total time of X, it follows equally that 

Xin = 20X; . (69) 
The ratio of X, to X, is distributed, therefore, in 
random samples, in a distribution independent of 0, 
and depending only on N, and N,. This distribution 
is the basis of the analysis of variance. The dis- 
tribution of X, given X, is, in fact, 
(Ny+No—1)!  XmX»P-aX, 

w N Qaem (70) 
for all values of X, from o to co. Without discussing 
the possible values of the parameter 6, therefore, the 

H 
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exact probability of the total time recorded in a 
second series of trials lying within any assigned 
limits is thus calculable on the basis of the total time 
observed in the first series. 

The probability of the value X, to be observed, 
exceeding any chosen value x, is the integral of the 
frequency element for all values of X, exceeding %, 
and is expressible as the sum of the first N, terms of 
a negative binomial expansion, i.e. 


pe Gay) te 4 XL Sates (x25) + 


(N,+N,—1)! x du T 
* Wp! eae) - MB 

It should be observed that such fiducial probability 
statements about future observations are verifiable 
by subsequent observations to any degree of pre- 
cision required. This is not possible for probability 
statements inferred about parametric values, save 
on the supposition that they are capable in 
some other way of direct observation. Probability 
statements about the hypothetical parameters are, 
however, generally simpler in form, and, once their 
equivalence is understood to predictions in the form 
of probability statements about future observations, 
they are seen not to incur any logical vagueness by 


Icasonwotethe Subjects of them being relatively 
unobservable. 


In carrying out such a Verification as that suggested 
above, it is to be sup i 


À 1 8; could at any time 
be made. For Verification, the original e 
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miist be held firmly in view. This, of course; is a 
somewhat unnatural attitude for a worker whose 
main preoccupation is to improve his ideas, It is 
perhaps for this reason that some teachers assert 
that statements of fiducial probability cannot be 
tested by observations. It is also to be noted that 
future events or repetitions of the same event, which 
would be independent for a fixed value of a parameter, 
will not generally be independent when the para- 
meter has a frequency distribution. 


4. Predictions from a Normal Sample 

A case of particular interest of the fiducial 
probability of future observations is offered by the 
process of sampling a normal distribution. From a 
sample .of N observations the two statistics, the 
estimated mean 


z= «509. (72) 
and the estimated variance of the mean, 
I b 
S NUS? A (73) 


subsume the whole of the information supplied by 

the sample about the population from which it was 

drawn. If p» is the true mean, the quantity, 
tp 


m “aks? (74) 


. has a distribution independent of the unknown 
parameters, well known to be 


(3(N—3))! V«(N—1) um g y (75) 
NET 


for N—r degrees of freedom. 
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This distribution has been very adequately tabu- 
lated so that the value of ¢ is known for all levels of 
significance ordinarily required; the equation 


B = X—sí (76) 


expresses p, as a random variable having a distribution 
of “Student” 's type, with N—r degrees of freedom, 
and a scale factor s, calculable from the sample 
observed by a rigorous fiducial argument, provided 
no information a priori is available. 

Ignoring, however, the mean, p, of the hypo- 
thetical population, it would equally have been 
possible by a direct fiducial argument to calculate 
the distribution, in the light of the N observations 
already made, of a further observation, x, so far 
unknown. For if it is to be drawn at random from 
the same population, x will be normally distributed 
with variance N times that of X, about the same 
mean, and independently of it, so that 


X—x 


has a normal distribution about Zero with variance 
(N--1) times as great as has #, and consequently 


when the observed value s is used to eliminate c has 
the distribution specified by 


X—X = stV N--1 

where, again, / has N— 

prediction is evidently c 
degree of precision. 

The same argument may be applied to predicting 

the value of a future sample of N’, or, in particular 

of the statistics #’ and s’ derivable from it. In this 


(77) 


I degrees of freedom, This 
apable of verification to any 
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case the value of the population variance could be 
estimated from the sum of the squares within both 
samples, that is as 
aon" EN din 29 , f T9 
wixwcZG6---Sw-£Y), ^ (8) 
and this quantity multiplied by 
xtX (79) 
N N' 


would provide the estimated variance of the difference 
between the observed means 


st 


Xx—X 


based on N+ N'—2 degrees of freedom. Consequently, 
for unselected samples 


T NN'. [N4-N'—2 

6-8) ANA? St S' 
will be distributed as is “Student” 's ratio 7 for 
N--N'—2 degrees of freedom. A second pivotal 


relationship is needed, since two values are to be 
predicted, and this is supplied by 


(80) 


2z = log a= log NS , (81) 


the logarithm of the ratio of two independent 
estimates of the same variance, of which the dis- 
tribution depends only on N and N', and is indepen- 
dent of the true mean. 

To avoid misapplication of the method it should be 
noticed that for all pairs of primary observations 
&, S, there is a one to one correspondence between 
pairs of "pivotal" values ¢, z, and corresponding 


IIB STATISTICAL METHODS AND SCIENTIFIC INFERENCE 


pairs of predicted observations x', S'; equally for 
any pair of values x', S’, a similar correspondence 
subsists between the pivotal values and the primary 
observations. Only on this condition the known 
frequency distribution of the pivotal values may be 
projected, or mapped, by direct substitution, to give 
the frequency distribution of the pair of unknowns 
to be predicted. The impossibility of statements of 
fiducial probability from discontinuous cases, such 
as the binomial distribution, is traceable to the fact 
that a single observational value corresponds, with 
any one parametric value, to a whole range in the 
values of the pivotal quantity, expressing the 
"probability integral" of the distribution. 

In 1936,4 addressing the Harvard tercentenary 
conference, I suggested that the condition for the 
further development of the use of fiducial inferences 
needed mathematical investigation, and would depend 
on the conditions of solubility of a type of problem, 
of which I gave, in general terms, an example, which 
has come to be known as the Problem of the Nile: 


The agricultural land of a pre-dynastic Egyptian 
village is of unequal fertility. Given the height to 
which the Nile will rise, the fertility of every 
portion of it is known with exactitude, but the 
height of the flood affects different parts of the 
territory unequally. It is required to divide the 
area, among the several households of the village, 
so that the yields of the lots assigned to each shall 


be in predetermined proportions, whatever may 
be the height to which the river rises. 


The problem has not, I believe, in the meanwhile 


yielded up the conditions of its solubility, upon 
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which, it would appear, the possibility of fiducial 
inference with two or more parameters, must depend. 

It should, however, be noted, in the case of the two 
parameters of the normal distribution, that the sum 
of squares, the statistic S, not only yields a sufficient 
estimate for the true variance c, but has a distribu- 
tion independent of the true mean, p, and so would 
supply also a solution of the type demanded in the 
Nile problem, if » were the unknown variable. 

An interesting application of the simultaneous 
distribution predicted for the two statistics of a 
future sample of N' values, is to allow N’ to increase 
without limit, so that #’ shall tend “in probability” 
to the population mean, p, and the ratio 


S/(N’—1) (82) 


to the population variance, o?. We then have the 
simultaneous distribution, in the light of the first 
sample only, of the two parameters characterizing 
the population sampled. The frequency element of this 
simultaneous distribution is found to be the product 


NER gine P g 
2-0? (ee 


I S wn 8. de? 
2 

The rigorous step-by-step demonstration of the 
bivariate distribution by the fiducial argument 
would in fact consist first of the establishment of the 
second factor giving the distribution of c given S, 
disregarding the other parameter, », and then of 
finding the first factor as the distribution of » given 
& and c. Several writers have adduced instances in 


(83) 
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Which, when the formal requirements of the fiducial 
argument are ignored, the results of the projection 
of frequency elements using artificially constructed 
pivotal quantities may be inconsistent. When the 
fiducial argument itself is applicable, there can be no 
such inconsistency. : 
It will be noticed that in this simultaneous dis- 
tribution (83) » and c? are not distributed indepen- 
dently. Integration with respect to either variable 
yields the unconditional distribution of the other, 
and these are naturally those obtainable by direct 
application of the fiducial argument, namely that 
p—% 


A (84) 


is distributed as is ¢ for (N—1) degrees of freedom, 
while 

S 

c? 


(85) 
is distributed as is x? for (N — 1) degrees of freedom. 
The distribution of any chosen function of p and o? 
can equally be obtained. The doubts expressed by 
Bartlett? on this point appear to be quite groundless. 

It should, in general, be borne in mind that the 
population of parametric values, having the fiducial 
distribution inferred from any particular sample, 
does not, of course, concern any population or 
populations from which that sampled might have 
been in reality chosen at random, for the evidence 
available concerns one population only, and tells us 
nothing of any parent population that might lie 
behind it. Being concerned with probability, not with 
history, the fiducial argument, when available, shows 
that the information provided by the sample about 
this one population is logically equivalent to the 
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information, which we might alternatively have 
possessed, that it had been chosen at random from an 
aggregate specified by the fiducial probability dis- 
tribution. 


5. The fiducial distribution of functions of the parameters 


A problem of this kind of some practical importance 
arises when it is desired to locate that value on a 
continuous scale which divides a hypothetical normal 
distribution in a given ratio. For example, to find 
the value which is only exceeded by one in forty of 
the population, of which we must judge by means of 
a randomly chosen sample of N measured individuals. 

If p and o are the mean and standard deviation 
of the distribution, the point to be located may be 


represented by 
prac (86) 


where, for the chosen frequency of one in forty, the 
value of a must be about 1:96, and is always known 
in terms of the frequency specified. 

If € and s stand for the mean and standard 
deviation as estimated from the sample of observa- 
tions we may put 


X--as = pac , (87) 


and calculate the distribution of the quantity a in 
random samples, for, any sample must yield such a 
value. The distribution of a, which will depend on a, 
but not on the unknown parameters p and o, 
effectively exhibits in known terms the fiducial 
distribution of the particular linear function of the 
two parameters, chosen for examination. Equally, 
if some value a, were chosen with the intention of 
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calculating from a succession of random -- 
drawn perhaps from different populations, the va 


X+as, (88) 
then a knowledge of the sampling postion p 
for given a would display the distribution o 


unknown deviate a, and therefore of the frequency 


Tatio in which the true distribution had been 
partitioned. 


The distribution of a for given a was first colpa 
for the Introduction to the Mathematical Ta A 
(vol. x)? of the British Association (1931) as 


illustration of the appearance in statistical work of 
the function 


Tf Lt) eas 
La- 7— em, i qi 


LOS (i -itez gy (89) 
V 2m J n! £ ^ 


.Using a sample of N observations as basis, it takes 
the form 
(N—1)! 


giw-9. NES ! 
2 


-4N -a aaV N 3 


(90) 


while the distribution of a for given a is easily 


expressed by putting 


—Á I 
VNT VN! eu 


where u is normally distributed with unit variance, 
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and x is distributed independently of u in the familiar 
distribution i 
I 
N—2 
2 


(3339979 , e7» x dy . (92) 
! 


The variable a has then the distribution of the sum 
of two variates distributed respectively in the normal 
distribution and in that of x for (N—1) degrees of 
freedom. 

On a point of pure logic it may be noticed that the 
sampling distribution of a for given a, like that of a 
for given a, is entirely independent of the parameters 
of the distribution. In the fiducial distribution 
found for the parametric function 


p+ac = X--as 


it is the introduction of the two sufficient statistics 
X and s, which brings in the requirement, that there 
must be no data a priori about the parameters, on. 
which the fiducial argument relies. It requires that 
the observed statistics can be taken as random, and 
are unselected, and therefore representative values 
for the population from which they were obtained. 


6. Observations of two kinds 


It has been shown that observations of different 
kinds may justify conclusions involving uncertainty 
at different levels. It will be of some interest to 
consider the logical situation when observations of 
two such kinds are both available. 

For example, let us suppose it to be possible to set 
a recorder to determine, for an exactly adjusted time 
interval, whether or no a charged particle has been 
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received in that interval. It will not be supposed 
that the instrument will count them, if it receives 
more than one, but that it is capable of distinguishing 
the possible case of none, from the possible group of 
cases of “one or more." 

Let p stand for the unknown probability of there 
being no particle, g the probability of there being one 
or more, then, if £ is the time interval for which the 
instrument is set, and 0 the unknown rate of delivery 
per unit of time, it appears that 


Deua - (93) 


If then out of n trials it is observed that on à 
occasions no particle is recorded, while on b occasions 
there was one or more, we have a logical situation, 
without knowledge a priori, having a Likelihood 
Function, 

& x jah, (94) 
in which expressions in @ can be substituted for p 
and g, to give the Mathematical Likelihood of any 
value of 0; but there isin the data no basis for making 
probability statements determining the probability 
that 0 should lie between assigned limits. 

Suppose, now, using the same supply of charged 
particles, it is possible to measure accurately a single 


randomly chosen time-interval between successive 


emissions. If this measured value is x,, then for any 
given @ the distribution of x is 
€-* 0 dx . 


the probability of the random 
the observed value, X, is 


(95) 


variable x exceeding 


dud (96) 
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and the fiducial distribution of 6 is 
dP-— ex d. (97) 


This fiducial distribution supplies information of 
exactly the same sort as would a distribution given 
a priori. In fact 


Poju-p, (98) 
supplies the element of frequency 
dP =p" dp , (99) 


needed to complete Bayes' method. The simul- 
taneous probability of p lying in this range, and of 
giving rise to the frequencies observed, is then 


(a+b)! 
alb! 


the integral for all values of p over the range from 
o to 1 is 


Apt gò dp ; (roo) 


(a—1+A)! 6b! (a+b)! 


Meera li ee * 
CEN talib? uos 
and the probability a posteriori is 
(GOH)! oos 
(a3) B1 Diag pe (102) 


If the observation x, were made first and the recorder 
set so that 

Secus (103) 

* The factorial function, #!, has been generalized from positive 

integers only to all real numbers exceeding —1, by the Eulerian 


integral 


e 
A Í tet dt . 
o 
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then A would be unity, and we should have e 
Bayes' solution; but if it were thought better to set, 
for example, -—— God) 
then A would be 1, and the probability distribution 
a posteriori would be 


ESI fq db. (105) 


Such a distribution a posteriori, whether expressed 
in terms of # or of 0, would be logically equivalent to 
a Bayesian probability a posteriori, or, equally, to one 
based exclusively on a fiducial argument, for the 


s : wn 
parameter is in each case a random variable of known 
distribution. 


7. Inferences from likelihoods 


The mode of inference which takes the form of 
probability statements about parameters can lead 
to the alternative mode in the form of probability 
statements about future verifiable observations, by a 
general form of calculation; namely, if 


F(0) do 
is the probability that the parameter lies in the range 
40, and if for values within this range the probability 
of any future observable contingency, A, is 


P4(0) 
then, eliminating 6, the Probability of A is 
Pa= | 240) F(0) a0 , (206) 


taken over all possible values of the parameter, 
The converse process of inferring the frequency 


ANS 
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distribution of 0 from a knowledge of quantities 
such as P,, for a sufficient variety of future possi- 
bilities A, is usually possible, and is extremely simple 
when A is taken to represent experience so ample as 
to confine the uncertainty of the parameter consistent 
with it, to an arbitrarily small range. 

The cases in which the observations, together with 
other data, allow only of a statement of the Mathe- 
matical Likelihood, require separate consideration, 
to obtain a clear view of the rational prospect which 
future contingencies present, when knowledge of the 
Likelihood Function only is available. 

In the previous Section it has been shown that, 
where data of two kinds are simultaneously available, 
one capable of supplying Likelihood statements only, 
while the other, however meagre and uninformative 
it may be quantitatively, is capable of leading to 
Probability statements, then the two kinds of data, 
available, as it were, in parallel, will supply inferences 
in terms of Mathematical Probability. And this is 
done, as Bayes had shown, by multiplying each 
element of the frequency distribution by a multiple 
of the corresponding Likelihood, this multiple being 
chosen to make the elements so formed add, or 
integrate, to unity. 

Further, from the nature of the Likelihood 
Function it is evident that if data yielding Likeli- 
hood statements only are available from two 
independent sources, the aggregate of the two sources 
of data will supply simply a Likelihood Function 
found by multiplying together the two functions 
supplied by its parts. 

In view of these various relationships it would be 
natural to expect that when the two types of nexus 
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represented by Mathematical Likelihood and Mathe- 
matical Probability are connected, as it were, m 
series, so that we have the likelihood of an exhaustive 
Set of possibilities, for each of which the probability 
of some event 4, is known, the whole would yield 
statements, not of probability, but of likelihood only. 
This humbler status is not, however, incompatible 
with substantial utility. 

Let us consider, from this standpoint, data of the 
Bayesian type, without knowledge a priori, in which 
a successes have been observed out of (a+b) trials, in 
relation to the rational bearing of such an observation 
on the prospect of observing c successes out of a 
subsequent (c+d) trials. For example, if 3 successes 
have been observed out of Ig trials, what is the 
Prospect of observing 14 successes in a subsequent set 
of 21 trials? 


Considering the two-by-two table 


3 16]| r9 
I4 7|e2r 
17 23| 40 


we may recall that the relative likelihood of the first 
Observation of three Successes to sixteen failures is 


1919 

Digs 35.168» (107) 
which can be raised to Unity by an appropriate 
f Similarly, with the second 
sample, if the probabilities of success and failure are 
$' and q’, the relative likelihood is 


nay 213" 
PRT Tg (108) 
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now, if p and $' are to be the same, so that the two 
samples are drawn fairly from the same population, 
the most likely common value to give them is the 
total ratio of success, or 17/40. Inserting this, and 
the complementary value, for p, q and for p’, q', we 
find the likelihood of the whole table, namely the 
expression 


17! , 2323 1g! . 2121 
T o ; 33.77. 14M . 1618 (109) 

in which the function 
x7 (x10) 


has replaced the factorial function (x!) used in the 
expression for the probability of the numerical table 
discussed, among others having the same margins. 
Taking logarithms to the base ro, the numerical 
values are 


x log x7 x log x* 

17  20:9176317 3 1:4313638 

I9 | 24:2963184 7 5:9156863 

21 277666052 r4  16:0457925 

23 313197402 16  19:2659197 
1043002955 40  64:0823997 
106-7411620 106:7411620 


997:5591335 


The relative likelihood of the two-by-two table 
under discussion is only about -003623, and this is 
the likelihood assignable to the prospective con- 
tingency of fourteen successes out of twenty-one, in 
. view of the data available. With the aid of a Table 
I 
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of x7, such as is given on page 137, it is easy to 
calculate such a series as the following: 


Number of successes 


out of 21 Likelihood 
I4 +0036 
13 *0093 
12 *0216 
II *0460 
IO *0904 
9 1642 


Above ro the likelihood is less than 1 [15, and such 
future contingencies may be recognized in advance 
as definitely unlikely; above 8 the likelihood is still 
less than 1/5. The two values 9 and ro thus lie in a 
zone in which the likelihood is still low. The test of 
significance discussed in Chapter IV, Section 4, 
could be used to confirm these judgements from 


another standpoint. The test of significance suffers, 
like the “‘confidence limits” calculated for a binomial 
distribution 


from some insensitivity due to the dis- 
continuity of the distribution, and this may be 
thought to outweigh the rather formal advantage of 
asserting significance at a given level, when this 


yields no probability statement more definite than 
an inequality, 


The likelihood assigned to a fourfold table is 
symmetrically related to the first (real) and second 
(conjectural) sample. The likelihood of observing 
I4 successes out of 21 as judged by data showing 3 
successes out of r9, 


is exactly the same as the likeli- 
hood in prospect of observi 


Serving 3 successes out of 19, 
judged on the basis of i 2 


As in the cases in which probability rather than 
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likelihood can be predicated, we may recover the 
likelihood statement appropriate to the parameters, 
by considering the likelihood in the limit of a very 
large sample showing in all N successes out of N. 
The likelihood could then be written 


(BNy"(gNy". IQ"(N—r19)'^7* 


NY 3°16" (pN—3)?*(qN— I6)eN-16 (xir) 
leading, when N is increased without limit to 
19? 35,16 
3.19824 (112) 


the likelihood as inferred directly for the parameters. 
It may be noted that the likelihood of a future 
trial yielding c successes and 4 failures does not 
involve the factor 
(c+d)! 


D (113) 


representing the number of ways in which such an 
outcome could occur. Nor, if a subsequent trial 
were to be made with chosen end-points, as in Fig. 3, 
would the likelihoods of these end-points involve the 
coefficients representing the number of paths in the 
extended triangle leading to each. Unlike a prob- 
ability, the likelihood is independent of the number 
of ways in which the result could be brought about, 
for a statement of Likelihood does not involve a 
measureable reference set. 


8. Variety of logical types 

It is anoteworthy peculiarity of inductive inference 
that comparatively slight differences in the mathe- 
matical specification of a problem may have logically 
important effects on the inferences possible. In 
complicated cases, such effects may be very puzzling, 
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as the conditions of solubility of problems of the Nile 
type have proved themselves to be. It may therefore 
be useful to consider some cases of extreme simplicity. 

Let us suppose that x and y are two observed 
quantities, known each to be normally, and indepen- 
dently distributed with unit variance, x about 2H 
unknown value £, and y about n. It will be require 
to draw inferences about the pair of values (£, 7) 
which may, of course, be represented as an unknown 
point, H, on a plane, on which (x, y) may be repre- 
sented by an observed point, O. p 

If the data were as above, without further restric- 
tion, the probability distribution of the unknown 
point, formally demonstrable by the fiducial argu- 
ment, is evidently a normal bivariate distribution, 
with unit variance in all directions, centred at the 
observed point (x, y). Additional data do, however, 
alter the character of the problem. It may be 
interesting to compare three cases: 

(a) H is known to lie on a given straight line. 

(b) H is known to lie on a circle. 

(c) The given functional relationship between £ 
and 7 does not confine this point either to a straight 
line, or to a circle, but to some other plane curve. 

In all cases the likelihood of 
4 any pair of parametric values is 


eris 


Where 7 is the distance from O 
to H. The nearest point on the 
curve to the observed point O 


then represents the Solution of 
Fic. 4 maximum likelihood: 
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If M stands for this point, the relative likelihood of 
any other point is 
exp {—}(0H?—OM?)}. 

Setting this equal to any chosen series of con- 
ventional fractions we shall have defined zones 
around the point of maximal likelihood M, com- 
prising all points satisfying the functional relation- 
ship, however these may be connected. 

(a) In the particular case in which H is restricted 
to a given straight line, M satisfies the condition for a 
Sufficient estimate, for the relative likelihood of any 
point H on the line is simply 

exp(-}HM?) (x14) 
and this is the same for all possible observation 
points on the line OM, produced if necessary. That | 
is, for all observations leading to the same estimate. 

The probability that HM should exceed any 
quantity u, positive or negative is therefore 


ie J e di; (115) 
otherwise stated, the fiducial distribution of H is 
Normal with unit variance, and centred at the 
estimate point M. 

In view of the relation (115) it appears that if the 
data were modified so that instead of an unlimited 
straight line, the possible range for H had a terminus 
T, then necessarily the fiducial distribution extends 
only so far as T, and at T it has a probability con- 
densation equal to 


a f edt, (116) 
TM 
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where TM is positive if M falls within the permitted 
range, but is taken to be negative if M lies outside it. 

Knowing the frequency distribution of H, having 
coordinates (£, 7), it is possible to calculate that of a 
second pair of random values (x', y^), the coordinates 
of a point O'. In the case of an unlimited straight 
line, it is easy to see that O' is distributed normally 
about M as centre, with variance unity in directions 
at right angle to the line, and twice as much in 
directions parallel with it. The contours of equal 
frequency density are ellipses with eccentricity r/V/2, 
or -707. The probability, on the data, of such a 


second trial lying within any defined area is then 
calculable. 


(b) 


Fic. 5 
The restrictio 
H to the circu 


y the distance OC between 
the centre of the circle, and the poi 
: EDD point ob l 
the sampling distribution of A 
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the same for all points H on the circle. Consequently, 
we have only to determine the frequency distribution 
of the angle HCO, for given values of the distance OC, 
in order to obtain a fiducially determined distribu- 
tion of the unknown point H. 

If R stand for the radius of the circle, 0 for the 
angle HCO, and a for the distance OC, then, since 


HO? = R*--a?—2aR cos 6 , (117) 
the factor of 
g- imo (x18) 
which depends on @ is simply 
ea aR cose — (119) 
But E77 cos 6 dé 
| 
ORE a*R* | 
= an(x Ler aa s ) (120) 
= 27 Jo(aR) 


expressed as a Bessel function. Hence, the fiducial 
distribution of @ is 


2 erodo, (x21) 


27 J (aR) 

the frequency density decreasing exponentially in the 
direction parallel to OC, at a rate, however, which 
depends on the ancillary statistic a. Here again we 
have a well-determined frequency distribution for 
the unknown point H (é, n), from which fiducial limits 
at all levels of probability can be calculated. 

(c) In general, however, it is not to be expected 
either that a Sufficient statistic should exist, or that 
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the most likely estimate could be made exhale 
means of ancillary values. In such cases ess 
inference is effectively completed by the calcula aa 
of the Mathematical Likelihood for each po dt 
Position of the unknown point. The prei E 
ignoring this quantity as a measure of rationa "m 
appropriate to such cases would seem to leave lid 
statistician who chooses this course without viet 
resource in respect of a great many subjects in w a 
rational inference is possible. The fact that SUM 
inferences may be desired, and are certainly possi 

in other cases, seems to be no reason for not attur aG 
our minds towards understanding the informatio 
actually available. that 

It is particularly to be noted in this example 


the differences in the logical form of the available 
inferences flow fro 
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x logio x 


o 
0-602 0600 
1-431 3638 
2-408 2400 
3-494 8500 
4:668 9075 
5:915 6863 
7:224 7199 
8-588 1826 

10-000 0000 
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40:520 4249 
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44:313 6376 
46-232 2125 
48-164 7993 
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52-070 2832 
54:042 3816 
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60-031 7767 
62-051 5197 
64-082 3997 


X*I0-* 


41 
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46 
47 
48 
49 
50 
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55 
56 
57 
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x logio X 
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115-595 5183 
117-839 3682 
120-089 8997 
122-347 o118 
124-610 6061 
126-880 5873 
129-156 8628 


131-439 3428 
133-727 9397 
136-022 5688 
138-323 1473 
140-629 5948 
142-941 8330 
145-259 7858 
147-583 3790 
149-912 5402 
152-247 1990 
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x log, x 
154-587 2865 
156-932 7359 
159-283 4817 
161-639 4600 
164-000 6087 
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I-0000 
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CHAPTER VI 
THE PRINCIPLES OF ESTIMATION 


1. Relations to other work i 

The logical principles of statistical reasoning, which 
it is my purpose in this book to set out for explicit 
consideration, have underlain and been implicitly 
required in the development of the two other main 
aspects of Statistical Science, namely (a) the mathe- 
matical methodology of the handling of bodies of 
observational data, so as to elicit what they have to 
tell us, and (b) the Design of the logical structure of 
an observational record, whether of an experiment, 
or of a survey, so as to ensure its completeness and 
cogency as a tool of research. In the two books that 
I have written with these ends in view it has not 
in either case seemed appropriate to enlarge upon 
purely logical considerations which had in fact found 
their fullest expression in earlier work on the Theory 
of Estimation. This theory is adverted to, therefore, 
in these books only with their particular ends in 
view. In Statistical Methods to exhibit the existence 
of competent and practical methods applicable to 
data of many types, to exemplify some of the kinds 
of complication which ordinarily arise, and to bring 
a wider class of cases into logical connection with the 
Analysis of Variance. In the Design of Experiments, 
I had chiefly in view, in this part of the book, the 
use of the concept of Amount of Information as a 
measurable characteristic by which the precision of 
an experiment could be anticipated, or confirmed, 
and compared with the expenditure.of effort entailed. 
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In both books I hoped that the examples ate 
would not only indicate methods useful in oH 
selves, but would also facilitate the develop M 
principles of reasoning by which a body of m d had 
be interpreted. And I believe they have inde EL 
this effect. Nevertheless, in both cases, it e 
object to set out only what was immediata k M 
able to a particular end, and to avoid the mu E. 5 
of abstract concepts which an adequate disc 
the subject from a logical standpoint n toj 
requires. I shall hope in this Chapter, on t logical 
trary, to direct attention primarily to the d 
aspects, and to develop these with a minimu 
mathematical and technical complexity. OS 

The Theory of Estimation discusses the ae m 
upon which observational data may be Aa. 
estimate, or to throw light upon the values of t hich 
etical quantities, not known numerically, W iem 
enter into our specification of the causal SyS je ^ 
operating. These principles have been more or le: a 
familiar for many years, but have been confused by 
number of false starts due to insufficient appreciation 
of the nature of the problem. i 

A primary, and really very obvious, consideration 
is that if an unknown parameter @ is being estimated, 
any one-valued function of @ is necessarily being 


estimated by the same operation. The criteria use 
in the theo 


meaning that the aver 
should be equal to the t 
true of any parameter, 


age value of the estimate 
rue estimand; forif this were 
it could not also be true of, 
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for example, its square. Another criterion in which 
the need for invariance in respect to functional trans- 
formations has been overlooked is that the Confidence 
Interval, or range within which the parameter is 
not rejected on some test of significance, shall be as 
short as possible. This is an inappropriate require- 
ment since the relative lengths of any overlapping 
intervals may be adjusted arbitrarily by a functional 
transformation. 

A distinction without a difference has been intro- 
duced by certain writers who distinguish “Point 
estimation", meaning some process of arriving at an 
estimate without regard to its precision, from 
“Interval estimation” in which the precision of the 
estimate is to some extent taken into account. 
“Point estimation” in this sense has never been 
practised either by myself, or by my predecessor 
Karl Pearson, who did consider the problem of 
estimation in some of its aspects, or by his predecessor 
Gauss of nearly one hundred years earlier, who laid 
the foundations of the subject. The distinction seems 
only to be made in order to support a claim, which is 
not indeed historical, to the effect that the authors 
have made in this matter an original contribution. 
It shows great confidence in the ignorance of students 
to put such a claim forward. 

The following is not a complete exposition of the 
Theory of Estimation, but an outline emphasizing 
the origin and relevance of the logical concepts used 
elsewhere in this book. 


2. Criteria of estimation 
The fundamental criterion of estimation is known 
as the Criterion of Consistency, and is essentially a 
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i ion 1S 
means of stipulating that the process of M. 
directed to the particular parameter under d -"-— 
and not to some other function of the : £ ne 
parameter or parameters. Of the wider ME 
made to express this idea, one at least has w-—. 
unsatisfactory, and all perhaps deserve res wr 

If a number, finite or infinite, of observable 

have probabilities of occurrence 


Bs, S(p)=1, 


; and 
which are known functions of the parameters; 


o fall 
if out of N observations, the numbers observed t 
in these are 


a; , S(a)—N, 

jes 
then a linear function of the observed frequencie 
22 

A = S(c;a;) G 

will take, when for each a is substituted its mean 
4; — N. bi , (123) 
the value 
A= NS(c,p;) , (124 


Which is a known 
therefore of the p 
The statistic 


function of the probabilities, 2, and 
arameters, 


Now, if, for example, the valu 


S (A) > 


e of 
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summed over all possible observational classes, were 
the parametric function 


log @ (126) 
we might make the estimate 
T= eu, (127) 


and notice that when for all observed frequencies 
their expected values are substituted, the estimate T 
is such that it becomes identical with the estimand 6. 
This property is evidently invariant for transforma- 
tions of the parameters, and does not imply the 
statement that T is an unbiased estimate of 0, for, in 
ing it happens that log T is an unbiased estimate of 
og 0. 

In respect to bias it should be noted that no 
difficulty is usually experienced in adjusting an 
estimate so that the average of the adjusted value 
shall be equal to any particular parametric function; 
what has sometimes seemed to need emphasis is that 
the estimate is not necessarily improved by such an 
adjustment, which will introduce bias, not previously 
present, into the estimates of most functionally 
connected values. Before making such an adjust- 
ment, consideration should be given to its purpose. 

The relations set out above are exact, and do 
not depend on the observed frequencies a being 
sufficiently large. It may well be that N of the 
frequencies are unity, and a very large, or infinite 
number, are zero; as, indeed will be the case when 
tolerably accurate measurements occur in the data, - 
for these will be interpreted as single observations 
within very small ranges 

%y+45%, , (128) 
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and any function 

1 St), (x29) 
where c is a continuous function of X will b 
recognized as a linear function of the frequencies, an 
therefore as a suitable ingredient from which a 
consistent statistic can be built. The use of m 
non-linear in the frequencies would in these case 
introduce discontinuities whenever two measure- 
ments happen to coincide. 

A Consistent Statistic may then be defined as: 


A function of the observed frequencies ye 

takes the exact parametric value when for t x 

frequencies their expectations are substitute : , 
This definition is applicable with exactitude to finite 
samples. m 
A much less satisfactory definition has often bee i 
used, namely that the probability that the error O. 
estimation exceeds in absolute value any value e, OT, 
symbolically, 
Pr{|T—6|>c},, (130) 


as the size of sample is increased, 

ues e, however small. 

With respect to a function of the observations, T, 
defined for all possible sizes of sample this definition 
has a certain meaning. However, any particular 
method of treating a finite sample of N, observations 
may be represented as belonging to a great variety of 
Such general functions. In particular, if T’ stand for 

. any function whatsoever of Ni, Observations, and 


Ty for any function fulfilling the asymptotic con- 
dition of consistency, then 


shall tend to zero 
for all positive val 


N MT +NN DT y_»} (131) 
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is itself a statistic defined for all values of N, and 
tending asymptotically to the limit 6, yet it is 
recognizable when N=N, as the arbitrary function 
T* calculated from the finite sample. 

In fact, the asymptotic definition is satisfied by any 
statistic whatsoever applied to a finite sample, and 
is useless for the development of a theory of small 
samples. 


3. The concept of efficiency 


An asymptotic or large-sample definition is, how- 
ever, appropriate as a first step in defining the concept 
of efficiency. Consider the statistic 


f= N S(a;c;) , (132) 


in which the c, are so far undetermined functions of 
0, the parametric function of which T is to be an 
estimate. Then, T is consistent, when 6— 6, , if 


65 = S5; (89)) » (133) 


and with the same coefficients c it will remain nearly 
consistent for small variations of 6 if 


G) 
x sf ; o(9) (134) 
0 
For large samples the variations to be expected 
may be made indefinitely small. Now the sampling 
variance of the linear function T is, exactly, 


V(T) = y St 99), (135) 


and this variance may be minimized for variations of 


K 
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the coefficients c, subject to the limitations that the 
estimate shall be locally Consistent, by minimizing 


&)-5(s2 ĉj FT a) - eS( (Dis) gie (36) 
Varying any particular value c; we find 
2p; ry = —pp;— 0, (137) 


from which » may be Poen by direct addition, 
giving 


Em. (138) 


and A by multiplying by c and adding for all classes, 
giving a second equation, 


2S(5jj) = M-u8, (139) 
so that for each particular coefficient 
be = o0 U^) 03-90, (140) 
or 
$,(c,—0,) = 2 S(5;(c, — 09) . (141) 
Hence it easily follows that 


VID) = 5th -09)- SEGDT (x42) 


which we may write briefly as r/I, Moreover, 


E e Ate Gud 


$:iN90 


For large samples, therefore, from populations 
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having any particular parametric value 6, the linear 
function 
= 25(% epi 

T=7S xen (144) 
is locally Consistent, and subject to this condition, 
has the least possible variance for samples of a given 
Size. Since the p; are known functions of 0, the 
general equation 


a 
S X) =0 (145) 
or 
IS x) = 0, (146) 


if m=pN, being linear in the frequencies, though of 
any algebraic form in 6, will give, with large samples, 
du estimates of the highest precision for all values 
of 0. 

In these limiting conditions the distribution of the 
estimate T tends to Normality, so that the specifica- 
tion of the variance as 

DN TT 
T a (147) 
where, as above, 7 stands for 
2 c[I(8b ‘| K: a ] 
i= sE (BH) - [pieno je 09 
is sufficient to specify fully the sampling distribution. 
The quantity 7 is the invariance of the estimate, and 2 
itself may be recognized as the amount of information 
to be expected for each observation made. The 
“to be expected" is a reminder that the 


qualification 
Jf a function of 6, of central import- 


quantity 7 is itse 
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ance also in the theory of small samples, and M 

with small samples the estimate of 0 obtained fro 

any sample will not be exactly correct. i $ 
The form of the Efficient equation of estimation, 


4,95,N _ (149) 
5 A) ak 
shows that it could be derived by maximizing o 
S(a; log 5) , Eo 


or by maximizing that factor of the Mathematica 
Likelihood of 8 which depends on 6, namely 
n5). p 
The solution of this equation is Consistent, M. 
invariant for transformations of 2 param 
Naturally, it is not generally unbiased. y d 
The equations of Maximum Likelihood are indes k 
the only equations of estimation, linear Ww urge 
observed frequencies, which are efficient with p: 
samples. The solutions of these equations are 
generally linear functions of the frequencies. 


4. Likelihood and Information 


The connections between the Likelihood Function 
and the Amount of Information are worth noting: 
The Likelihood Function is determined by a partic 
ular sample, or corpus of observations, and shows aes 
such observations the relative frequency with which 
different parametric values would yield such 4 
sample. When the logarithm of the likelihood is 
used, different independent samples, of the same or 0 
different kinds, which throw light on the same para- 
meter, may be combined merely by adding the log 
likelihoods for each value of the parameter. 
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For any particular value of the parameter, the 
probability of obtaining a given observational record 
may be represented by 4. Then using summation 
over all possible samples, it is clear that 

S(4)—1, S(¢’)=0, S($)- 0, (152) 
where differentiation with respect to a parameter is 
indicated by a prime. 

Now 


22 I: 4 

302 (log 4) Ti $ $ i , (153) 
so that multiplying by ¢ and adding for all possible 
samples 


HA (log 2n - sqn-s($) EAA 


QD 0—I, 


since the last summation is merely the information 
expected for samples from populations having the 
particular value of the parameter chosen. 

The average value of 


Z (log 4) = — agit (155) 


90? 

is thus equal to the amount of information expected. 
This relation shows very simply that the amount of 
information, like the likelihood function, is additive 
for independent bodies of data, even if of different 
Sorts. 

The value of the second differential coefficient of 
(—L) with respect to 0 is referred to as the amount of 
ized at any value of 0. It is usually 
value for which L is maximized, 


information real 
evaluated at that 
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since this has the highest likelihood of being the We 
value, but may be evaluated at any chosen value. A 
the maximum it measures the geometrical curvature. 


5. Grouping of samples 


Ideally, we should like an estimate completely to 
replace the data from which it is drawn, so that no 
distinction need be made among different hypo- 
thetical lots of data which might yield the same 
estimate. This situation is sometimes, but not 
always, realizable. - 

If, for any given value of 6, the probabilities of 
observing the different types of samples which are to 
be grouped together are 


QUT (156) 
X9)-o (157) 


Li $ ^ 9 iia Lj 9? Qo'? 158) 
x -.) =3(4) er ei 

and since the expression on the left cannot ever be 
negative, it follows that the contribution to the 
amount of information is never increased by such 
grouping, and that the condition that it shall not be 


diminished is that, for all configurations yielding 
indistinguishable estimates, 


such that 


Then 


$'/$ (159) 
shall be constant. 
If this is so for all values of the parameter it 
follows that A 
log $= L (160) 


shall be the same function of the Parameter, apart 
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from an additive constant. Samples to be grouped 
together must therefore have identical likelihood 
functions, if the grouping is not to be accompanied 
by loss of information. It is a characteristic of 
Sufficient statistics that the likelihood function of the 
statistic shall be the same as that of all samples from 
Which such a statistic could have been calculated. 

The actual loss incurred in other cases may be 
calculated by determining the frequency distribution 
of the statistic to be used in terms of its parameters, 
and calculating the amount of information supplied 
by a single observation from such a distribution. 
This supplies the criterion for efficiency in finite 
samples, and so completes the second stage of the 
theory. The information recovered will in general be 
less than the amount of information in the data from 
which the estimate was calculated, the differences 
having been lost through grouping the samples of 
different kinds which lead to the same estimate. 
With estimation by the Method of Maximum Likeli- 
hood, although the likelihood functions of samples 
leading to the same estimate may be different, yet 
the method of estimation has selected for grouping 
samples having likelihood functions so far alike as to 
have their maxima all at the same parametric value, 
and therefore with stationary ratios in the most 
important region. The method will necessarily lead 
to Sufficient Statistics when these exist. 

In The Theory of Estimation (r925)! a good 
many examples have been given showing the loss of 
information for small samples of given size. When 
the likelihood function is everywhere differentiable 
these losses typically do not exceed the value of two 
or three observations; when the method of maximum 
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likelihood is used in problems in which no sufficient 
estimate exists. : 
The concept of efficiency was introduced first in 
the foregoing discussion by a definition valid only in 
the limit for large samples. Such an approach is 
incomplete and must be taken only as a first step. In 
comparing estimates, all of which tend to be dis- 
tributed Normally in the limit, a comparison of pre- 
cision is immediately available; it is the evaluation 
of the maximum attainable that leads to the concept 
of the amount of information in an observation dis- 
tributed in an error curve of any form,as is an estimate 
froma finitesample. Itisthe amount of information, 
and not the sampling variance, which completes the 
criterion of efficiency for finite samples, for when the 
distribution is not Normal the variance is an imperfect 
measure of the precision. The property, demonstrated 
above, that the amount of information may be dimin- 
ished, or conserved, but cannot be increased by the 
processes of statistical reduction, guarantees the ap- 
propriatenessof completing the definition byitsmeans. 


6. Simultaneous estimation 


In most practical situations there are a number of 
unknown Parameters, not all of them necessarily of 
interest on their Own account, but required in 
connection with the estimation of others. The 
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of the parameters are chosen, principally with a view 
to their being near to the final solution, so far as this 
can be foreseen, but partly also for their simplicity 
as a basis for calculation, and the expressions for 
S, = aL age (x62) 
S, = aL[6; , 
etc., are evaluated by substituting this particular set 
of parametric values. These are known as the 
efficient scores, and it is by their means that the trial 
value may be adjusted. 

For a single parameter the adjustment is, as has 
been seen, effected by dividing the score by the 
amount of information, that is, if 80 is the adjust- 
ment required by a trial value 0', then 

180 — S (162) 


will supply an adjusted value 6'.-80 which is an 
Efficient estimate on the data available; for this 
reason it is seldom necessary, though always possible 
as a check, to repeat the calculation with an improved 
trial value. 

With many parameters further polishing of the 
Solution is more often wanted, because in some types 
of data, the first appraisal is less likely to be correct. 
The amount of information, which, with one para- 
meter is a simple scalar, becomes a symmetrical 
matrix, or square table of coefficients, in the general 


case, the coefficients being 
2 3 
sG sfasa 
2 
Le sup). jon 


etc. 
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or, written more compactly, 
f $ op op | : (164) 


where 7, s, are suffices specifying particular para- 
meters 6, , 6,. 

The series of adjustments 86 are calculated from 
the linear equations of which the coefficients Ea 
given by the information matrix, and the right han 
sides, by the Efficient Scores. In matrix form 


I88— S, (x65) 


in which now 80 and S stand for series of values 
corresponding with the series of parameters. A 

An outstanding advantage of this form of analysis 
is that the precision of the simultaneous estimate JS, 
for large-sample theory, given by a covariance matrix 
which is the reciprocal of the information matrix. 
So that if the matrix product IV is equal to the 
identity, then V supplies the variances and covari- 
ances of the components of the efficient estimate. 
The matrix V can then be calculated simply bY 
replacing the scores by the several series 


LO; On Ô 

Gy, ur O O (166) 
Oy ©, 38 s) 

07.30, 270» E etc. 


and if this way is chosen the adjustments 86 are 
obtained by direct multiplication indicated by the 
matrix product 


80=VS. 


It is to be noted that whereas the matrix I is 
exact for small samples, the identification of its 


THE PRINCIPLES OF ESTIMATION I55 


reciprocal V with the covariance matrix of the 
Simultaneous estimate is not exact; with small 
samples the sampling variation of the components 
will not generally be Normal, so that the covariance 
matrix cannot completely specify the distribution. 

With a single parameter, if 0 were replaced by any 
known function 4(0), the amount of information, in 
each ingredient to be summed, is multiplied by 


Ge) 


n= (55) T. (167) 


Where I, and I, stand for the amounts of information 
Inany body of data in respect of ¢ and 0 respectively. 
Similarly, with two or more parameters, when J has 
been replaced by a matrix which may still be written 
I,,if A is a matrix such that 


So that 


26, 
a= 3g, (168) 
then I, is given by the matrix product 
TOC AAT 
$ m E " (169) 


where A* stands for the transpose of A, and the two 


expressions are equivalent because J, and J, are 


symmetrical matrices. 
This transformation system for the information 


matrices is exact with finite samples whereas the 
corresponding transformation of the covariance 


matrices could only be approximate. 
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7. Ancillary information 


The study of the sampling errors, that is, of the 
precision, of statistical estimates, the core of the 
Theory of Errors as developed by Gauss, has led to 
the recognition among the multitude of Consistent 
estimates which can be invented, of a smaller class, 
such that, in the important class of cases in which the 
sampling distribution tends in large samples to the 
Normal form, then the limit of the product of the 
variance and the size of sample shall be as small as 
possible. The existence of such a limit has been 
demonstrated above, and its value has been expressed 
in terms of the relations between the unknown para- 
meters and the frequencies of observations of all 
recognizable kinds. Such statistics of minimal 
limiting variances are termed Efficient, and thus 
satisfy a second rational criterion of what is required 
of a statistical estimate. It is easy to demonstrate 
that any two estimates both efficient must have 2 
correlation in random samples, and that their 
correlation coefficient must tend to a limit +1 as the 
size of the sample isincreased. In fact, in the theory 
of "large samples" all efficient estimates are 
equivalent. 

The theory of large samples can, however, never 
be more than a first step preliminary to the study of 
samples of finite size, although in fact a great many 
practical problems do Dot need for their effectual 
resolution, any further refinements. I do not think 
that this 1s a reason for not developing those concepts 
required for exact thought on small-sample problems. 
In „such problems the different possible Efficient 
estimates must be distinguished. So far as the 
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choice among them is concerned, a rational criterion 
is that we should prefer that estimate which con- 
Serves the most, or loses the least of the information 
Supplied by the data. In practical terms, if from 
samples of ro two or more different estimates can be 
calculated, we may compare their values by con- 
Sidering the precision of a large sample of such 
estimates each derived from a sample of only 1o, and 
calculate for preference that estimate which would 
at this second stage give the highest precision. I do 
not know of a general proof, but no exception has 
been found to the rule that among Consistent 
Estimates, when properly defined, that which con- 
Serves the greatest amount of information is the 
estimate of Maximum  Likelihood. The unique 
position of this method of estimation is also indicated 
by its being the only one in which the equations of 
estimation are linear in the frequencies. With small 
samples this obviates the irrationality of discon- 
tinuous changes in the estimates corresponding with 
minimal changes in the data. 

A realistic consideration of the problem of estima- 
tion in small samples thus points unmistakably to the 
estimate of maximum likelihood as the uniquely 
appropriate single value for use in estimation, if any 
single value (ie. with no ancillary values) is to be 
used. Itindicates also that when using the maximum 
likelihood estimate, some loss of information may 
occur, and, although quantitatively this loss may be 
trifling, it cannot be unimportant logically, especially 
when an exhaustive treatment of the data is required 
in the calculation of probability statements. 

The most important step which has been taken so 
far to complete the structure of the theory of estima- 
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tion is the recognition of Ancillary statistics. The 
notion was first developed in a detailed study of the 
amount of information lost calculated exactly for a 
number of trivial but representative problems. It 
was shown not only that loss of information must 
vanish if all types of sample yielding the same 
estimate had identical Likelihood functions, and 
that for maximum likelihood estimates all must have 
functions stationary at the estimated value, but 
further that the loss of information suffered in the 
limit of large samples, was expressible directly in 


terms of the sampling variance of the second 
differential coefficient 


L 
302? 


and indeed is equal in the limit to this variance 
multiplied by the variance of the estimate. When 
the likelihood function is repeatedly differentiable 
therefore, loss of information by simple estimation 1s 
due to the variance in the amount of information 
"realized". A simple remedy, appropriate for this 
asymptotic situation is merely to record, not only 
the estimate of maximum likelihood but the amount 


of information realized, or the "apparent precision", 
using this to supply a “weight” for the estimate more 


precise than that supplied by the size of the sample. 
Such a procedure, though merely asymptotic, and 
efinement of the large-sample 


antee that the loss of information 
in large samples shall tend 


s ; to zero. The use of 
further differential coefficients of the Likelihood in 
the neighbourhood of the 


E estimate can, by an 
extension of the same Process, reduce the limit of 
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N times the loss of information, or of N* times its 
value to zero, as N tends to infinity. This is indeed 
an evasion of the problem of small samples properly 
speaking, but it does serve to show that it is the 
Likelihood function that must supply all the material 
for estimation, and that the ancillary statistics 
Obtained by differentiating this function are in- 
adequate only because they do not specify the 
function fully. 


8. The location and scale of a frequency distribution of 
known form 


If the probability of an observation falling into the 
range dx is given in the form 


df= exp (555)] 5. (170) 


in which ¢ is a function of known form, which may 
be taken to be differentiable almost everywhere, and 
a and B are two unknown parameters specitying the 
location and the scale of the distribution, then the 
logarithmic likelihood function is 


L= —N log e«s[o(55*)]. (171) 


where S stands for summation over the N values ofa 
sample observed. It is not to be expected that the 
sum of the functions 4 will be algebraically simple, 
and in consequence we shall not in general be able 
to express the likelihood in terms simply of the 
parameters and appropriate estimates of them. 
There will generally be no Sufficient pair of estimates. 
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The equations of maximum likelihood are easily 


seen to be 
sie] =o 


sl AE) s (172) 


in which $’ stands for the first differential coefficient 
of the known function $. 

Let us now suppose that A and B are values of a, B 
satisfying the equations of estimation, and that ¢ 1S 
such that these values are real and unique, then the 


sample observed will have supplied a set of N values 
of u such that 


X — A+Bu,, (173) 

and the values of u satisfy the conditions 
S(9'()) = o (174) 
S(u$'(u)2-N — o. (175) 


The particular set of values x satisfying these 
equations, and derived from the sample observed, 
may be said to specify the complexion of the sample. 


It is easy to see that if for the values x observed we 
had instead the values 


X=Atpx, (176) 
then the complexion of the sample would be un- 
changed. To state the matter otherwise, the com- 
plexion depends only on the ratios of the N—I 
Successive differences among the N observations, 
when these are arranged in order of magnitude. Itis 
evident, moreover, that the Sampling variation of 


these ratios are severally and Jointly independent of 
the parameters. The values u, Specifying the 
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complexion, thus define by their differences a set of 
N—2 functionally independent ancillary statistics, 
and the precision of the values A and B arrived at 
Should be judged solely by reference to the variation 
of estimates among samples having the same 
complexion. 

The precision of the estimates A and B may be 
specified by the measures of deviation 


A-a 


B 
ty = B 208 ty = B , (177) 
and since 
T = pul, , (178) 
the simultaneous frequency distribution of /, and £, is 
df «c eSti t N- dt, dt, . (179) 


Moreover, the distribution of ż, does not depend on 
«, being 


df « f dt, eSWatua f N- dt, , (180) 


so that the fiducial distribution of 8 may be found 
independently of a, and thence that of a for given £, 
as in the case of the Normal distribution; in fact the 
simultaneous distribution of a and £ in the light of a 


given sample is 
af «exp|s|* (H5 uF) }] Fa da dp, (181) 


for the actual set of 4-values observed. 

It can easily be verified that the distribution found 
in Chapter IV for the simultaneous frequency dis- 
tribution of the mean (pu) and the standard deviation 

L 
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(e) of a Normal distribution is a particular case of the 
solution given above, appropriate to the case 


$(v) = c— 30? ; (182) 


the existence of Sufficient estimation, in that case, is 
replaced by the two estimates being in general 
rendered Exhaustive by taking account of N—2 
independent Ancillary Statistics. 

It is important to recognize the nature of the 
inversion that has been effected by the fiducial 
argument in this and in analogous cases. From the 
familiar form in which we have a frequency distribu- 
tion of estimates such as A and B expressed in terms 
of the parameters, hypothetically supposed known, 
a and f, we have passed to a frequency distribution 
of a and £ in a distribution specified by observable 
quantities including A and B. Since in reality A and 
B may be calculated from the observations, and are 
known in terms of them, while a and £ are in reality 
unknown, the latter form of statement is the more 
realistic in representing the state of knowledge of a 
possible observer, while the former is a statement of 
what could be known in a hypothetical situation 
before any real observations had been made. By the 
fiducial argument we pass from the mathematical 
expression of a hypothetical to one of a realistic 
situation, in which the parameters are unknown, 
though exact probability statements can be made 
about them. Bartlett’s criticism of the fiducial 
inference as “only to be regarded as a symbolic one” 
thus seems to be an example of mistaking the sub- 
stance for the shadow. His Statement that "there is 
no reason to suppose that from it we may infer the 
fiducial distribution of, say, p+ o” is presumably due 
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to some analytic misapprehension. The problem 
involves no more than ordinary integration over the 
known bivariate frequency distribution, and its 
solution had been published before Bartlett wrote. 


9. Àn example of the Nile problem 

An example which illustrates well the connection 
between particular mathematical relationships on the 
one hand, and the existence of ancillary statistics, by 
means of which estimation can be made exhaustive, 
and exact probability statements inferred, as their 
consequences in mathematical logic, is as follows. 

We suppose that pairs of observables (x, y) are 
distributed in a bivariate distribution 


df = eg em a dy , (183) 
in which x and y take positive values only, then, if 


X, Y stand for the sums of the two coordinates over 
N pairs of observations, so that 


S@=X, S(y)=¥, (x84) 


the Likelihood of any value 6 in the light of the sample 


observed is 
g- OX+¥/0) (185) 


so that its logarithm is determined by 


L = —(0X-- YJ0) . (186) 
The equation of maximum likelihood is then 
ACEITE (187) 


leading to the estimate 


T-VY|X. (188) 
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The amount of information expected from each 
pair of observations will be the mean of the square of 


SECOS 


but 


io = p’ (x191) 


or, considering the amount of information relative 
to log 0, 


dogs = 2. (192) 
The amount of information supplied in a sample of 
N values is therefore 

I, = 2N][02 

Logo= 2N. (193) 


Since the likelihood cannot be expressed in terms 
only of 0 and T, there will be no Sufficient estimate, 
and some information will be lost if the sample is 
replaced by the estimate T only. This loss of 


information may be calculated from the exact 
sampling distribution of T. 


1o. The sampling distribution of the estimate 

Since 20x and 2y[0 are distrib 
in exponential distributions eq 
degrees of freedom, it follows 
are similarly distributed for 2 


uted independently 
uivalent to x? for 2 
that 20X and 2Y/0 
N degrees of freedom, 
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or, stated otherwise that their simultaneous dis- 
tribution is 


df — AES J Tae. g-Gex-Yl dX dY . (194) 


(N—x)! (N—1)! 
Since we require the distribution of 
T= VIX (195) 


we may substitute 
NUIT. EY: SUE (196) 
and obtain the simultaneous distribution of T and U, 
namely 
EZ UA U 
gie e+e). (Wo)! Wal T: (197) 
The distribution of T alone is obtained by integration 
with respect to U from o to co, giving 
(2N—1)! TERU EUCH xu 
4f — Na)! TET +7) Fi C9 


the logarithm of the factor involving @ is 


que 20 
-2N log (57): (199) 
its differential coefficient with respect to 0 is 
BN (a. ON (a 208 
N (T-0) Gr) eo 


and the mean value of the square of this, giving the 
amount of information to be expected from a single 
observation, T, is to be evaluated. 
Now from (198) it appears that 
T Pyde, See (201) 
E(T-T) -3NGNT' 
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2 2 
G-r) - (eT) - (209 
hence 


(o / G44) 0-27. (209) 


The amount of information is therefore 


4N? 4N 
INTE (s *- s) 
2 N 
=F. (204) 


being less than that supplied by the observations by 
one part in (2N+ 1). 


11. The use of an ancill 


ary statistic to recover the information 
lost 


The loss of information is less than half the value 
of a single pair of observations, and never exceeds 
one third of the total available, Nevertheless, its 
recovery does exemplify very well the mathematical 
processes required to complete the logical inference. 

From the simultaneous distribution of U and TIE 
we may find that of U only mere 
(197) with respect to T. The int i 


standard form for the Bessel function Ko, and gives 
the distribution of U as 


df = 4K,(2U) Wale . — (eos) 


pendent of 0, U is available 
The sampling distribution 
nt, is found by dividing the 


As this distribution is inde 
as an ancillary statistic. 
of T, taking U into accou 
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bivariate element b i i 
y the corresponding mar al 
frequency of U, and is evidently 3 


I -v(Z +4) aT 
ZKU) Gre (208) 
From such an error distribution, having known U, 


the amount of information, calculated as usual, 


comes to 
2N K,(2U) (207) 


$2 ^ K(2U)’ 
which, it will be observed, depends upon the value of 
U actually available, but has an average value, when 
variations of U are taken into account, of 


2N][0? (208) 


the total amount expected on the average from N 


observations; none is now lost. 
The information is recovered and the inference 


completed by replacing the distribution of T for 
given size of sample N, by the distribution of T for 
given U, which indeed happens not to involve N 
at all. In fact, U has completely replaced N as a 
means of specifying the precision to be ascribed to 


the estimate. In both cases the estimate T is the 
les us to see exactly 


same, the calculation of U enab 
how precise it is, not on the average, but for the 
lied by the sample. 


particular value of U supp 
In these circumstances it is 

precision by an exact statemen 

of 6 lying in any chosen range. 


write 
7 = log T—log 9, (209) 


possible to specify the 
t of the probability 
Conveniently, if we 
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then 7 has the distribution 


I —2U cosh 7 210 
2K ,(2U) f Ls Gto) 
and the definite integral of this distribution between 
any chosen limits, 7, and 7, gives the probability 
that 0 should lie between the corresponding limits 


Te- and Te- , (211) 


It will be noticed that the distribution of 7 is 
symmetrical. The success of the process by which 
the missing information was recovered, and the 
statement of probability a posteriori rendered exact, 
evidently depends on the distribution of 


solution of the Nile pr 
total frequency lying 
hyperbolas 


wy =c, Xy = Cy, (213) 


epend only on the 
Such curves therefore 


to which the Nile will rise. : 

With a different mathematical Specification of the 
problem, different logica] consequences might ensue. 
If we take a more &eneral distribution 


df = 0de-- dy dy , (214) 
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involving two parameters 0 and $, it may be seen 
that with any connection between 0 and $ of the 


form 
$10 (215) 


it will be possible to find an ancillary statistic and to 
derive probability statements about the parameter 
specifying 0 and 4. However, the greater part of 
functional relationships which might subsist between 
two such positive quantities do not have this property, 
and apart from approximate statements appropriate 
to large samples, the totality of the information 
which the data supply is subsumed in the specifica- 
tion of the Likelihood function for all values of the 
unknown parameter. 


12. Simultaneous distribution of the parameters of a bivariate 
Normal distribution 

If from a Normal p 

oł, oł and correlation 

Sufficient estimates $, S2 

I915? that the sampling 

expressible in terms of p only, 


opulation with variances 
p, & sample yields the 
and rz, then it was shown in 
distribution of y was 
in the frequency 


element 
I > (N—4) Pus 9 
RT UN P DRE 27 ie) .429Q0-49 . ———— N—25nah' 
Neg a eom co 
(216) 
where cos 0 = —pr, ando «0 <7. 


Since the distribution of 7 does not depend on the 
parameters other than p, We have a known function 


of y and p, 
P(r, p); (217) 
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such that the distribution of 7 for given p is given by 
the frequency element 


TI dr , (218) 


and the frequency of p for given 7 is 


= 5, Qv. )} dp , (219) 


giving the marginal distribution of p in terms of 7 
only. This was actually the first example of the 
derivation of a fiducial distribution (1930).* 


For any given values of 7, p the simultaneous 
distribution of s, and s, is 


= (a2) 


GSP) ; 91 05 
N-—r (si $18» , S$| | ds, ds, 
exp| 2(I—p?) E. mu 91 Oe S: EM 91.09 (229) 
divided by the function of rp only 
av 0 
9(—cos 0)*-?sin à ` (22m) 


If we write 


N—r s = ies 
u= Al PES ^ ae 222 
I—p? 91 I—p? op D ( ) 


the distribution for given values of 7, p becomes 


(uv)¥-2 exp{—4(u?—2rpuv4 v?) du dv 
Cag ne 
(sin 026)*-? sin 6 ` (223) 
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Now for any chosen values £ i 
: , expression (22 
will supply a function P(é, 7) such that d 


Pr{u>£é, v>n} = P(E; 2) » (224) 


or, dividing each value into 


sv Wa") 


No zs 
Pri as <y T3 nda iT? (225) 


giving the simultaneous fiducial distribution of c, 
and c, with the frequency element 


2? 
L—— Pi, n) do, des 


00, 005 
_ N-I SSe 
cor enr (Em) * exp(- &E?—2rpben!)) des dos 


= (£0) exp(-1(£2—2rp£n--1)) dei de,, (226) 


divided by 
"Panes -— (22 
8(—cos 0) sin 0 ' 7) 
and multiplied by the marginal frequency 
G] 
s UL p)} de » 


viating symbols, standing 


(228) 


in which £ and » are abbre 


for 
NT “and NT A. (229) 
I—p? 9e, D pa 


tribution of these 
for the population 
the consideration 


The simultaneous fiducial dis 
three parameters with two more 
means, may then be found from 
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that the set of statistics, s,, są, 7 is distributed 
independently of the means, so that we need only a 
further factor representing the fiducial distribution 
of the means for known values of o}, c, and p. 
Namely, 

N Mi {E — 2p eea (en-us)? 


— MM RIS 173 si Jer dus , 
anViI—p? e] 


(230) 
It has been proposed that any set of functions 


having distributions independent of the parameters, 
such as 


4 = VN-1 
Cy 
s N—1)(i—7? 
c a Mz, 22) (231) 


can be used to transform the simultaneous frequency 
distribution of $,, Sa, v in terms of 9; , 95, p, into 
the simultaneous distribution of 91, 99, orth terms 
Of s,,5,,7 simply by multiplying by 


alsi, 8,7) (t, 7a, ty) 
ee ary E 3R^9IS deg) 
a(t, ity, 13) R 9(o, » 0, p) (232) 


a genuine fiducial 


argument: The expressions (231) cannot indeed be 
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made to supply such an argument. Perhaps the 
change of sign of t,/@p at p=0 should be a sufficient 
warning. 
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